[ 
https://issues.apache.org/jira/browse/HUDI-6203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17722893#comment-17722893
 ] 

Amrish Lal commented on HUDI-6203:
----------------------------------

PR: https://github.com/apache/hudi/pull/8645

> Add support to standalone utility tool to fetch file size stats for a given 
> table w/ optional partition filters
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: HUDI-6203
>                 URL: https://issues.apache.org/jira/browse/HUDI-6203
>             Project: Apache Hudi
>          Issue Type: New Feature
>            Reporter: Amrish Lal
>            Priority: Major
>
> Provide file size stats for the latest updates that hudi is consuming. These 
> stats are at table level by default, but specifying 
> \-{-}enable-partition-stats will also show stats at the partition level. If a 
> start date ({-}{-}start-date parameter) and/or end date ({-}{-}end-date 
> parameter) are specified, stats are based on files that were modified in the 
> half-open interval [start date ({-}{-}start-date parameter), end date 
> ({-}-end-date parameter)). --num-days parameter can be used to select data 
> files over last --num-days. If --start-date is specified, --num-days will be 
> ignored. If none of the date parameters are set, stats will be computed over 
> all data files of all partitions in the table. Note that date filtering is 
> carried out only if the partition name has the format '[column 
> name=]yyyy-M-d', '[column name=]yyyy/M/d'.
> The following stats are produced by this class:
>  * Number of files.
>  * Total table size.
>  * Minimum file size
>  * Maximum file size
>  * Average file size
>  * Median file size
>  * p50 file size
>  * p90 file size
>  * p95 file size
>  * p99 file size



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to