[
https://issues.apache.org/jira/browse/HUDI-6203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17722893#comment-17722893
]
Amrish Lal commented on HUDI-6203:
----------------------------------
PR: https://github.com/apache/hudi/pull/8645
> Add support to standalone utility tool to fetch file size stats for a given
> table w/ optional partition filters
> ---------------------------------------------------------------------------------------------------------------
>
> Key: HUDI-6203
> URL: https://issues.apache.org/jira/browse/HUDI-6203
> Project: Apache Hudi
> Issue Type: New Feature
> Reporter: Amrish Lal
> Priority: Major
>
> Provide file size stats for the latest updates that hudi is consuming. These
> stats are at table level by default, but specifying
> \-{-}enable-partition-stats will also show stats at the partition level. If a
> start date ({-}{-}start-date parameter) and/or end date ({-}{-}end-date
> parameter) are specified, stats are based on files that were modified in the
> half-open interval [start date ({-}{-}start-date parameter), end date
> ({-}-end-date parameter)). --num-days parameter can be used to select data
> files over last --num-days. If --start-date is specified, --num-days will be
> ignored. If none of the date parameters are set, stats will be computed over
> all data files of all partitions in the table. Note that date filtering is
> carried out only if the partition name has the format '[column
> name=]yyyy-M-d', '[column name=]yyyy/M/d'.
> The following stats are produced by this class:
> * Number of files.
> * Total table size.
> * Minimum file size
> * Maximum file size
> * Average file size
> * Median file size
> * p50 file size
> * p90 file size
> * p95 file size
> * p99 file size
--
This message was sent by Atlassian Jira
(v8.20.10#820010)