[
https://issues.apache.org/jira/browse/HUDI-7166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ma Jian updated HUDI-7166:
--------------------------
Summary: Provide a Procedure to Calculate Column Stats Overlap Degree
(was: Providing metrics for archive and defining som string constants)
> Provide a Procedure to Calculate Column Stats Overlap Degree
> ------------------------------------------------------------
>
> Key: HUDI-7166
> URL: https://issues.apache.org/jira/browse/HUDI-7166
> Project: Apache Hudi
> Issue Type: New Feature
> Reporter: Ma Jian
> Priority: Major
>
> In [HUDI-7110] Add call procedure for show column stats information (#10120),
> a tool has been made available to display column stats. However, this tool is
> not very user-friendly for manual observation when dealing with large data
> volumes. For instance, with tens of thousands of parquet files, the number of
> rows in column stats could be of the order of hundreds of thousands. This
> renders the data virtually unreadable to humans, necessitating further
> processing by code. Yet, if an administrator simply wishes to directly
> observe the data layout based on column stats under such circumstances, a
> more intuitive display tool is required. Here, we offer a tool that
> calculates the overlap degree of column stats based on partition and column
> name.
>
> Overlap degree refers to the extent to which the min-max ranges of different
> files intersect with each other. This directly affects the effectiveness of
> data skipping.
>
> In fact, a similar concept is also provided by Snowflake to aid their
> clustering process.
> https://docs.snowflake.com/en/user-guide/tables-clustering-micropartitions
> Our implementation here is not overly complex.
>
> It yields output similar to the following:
> |Partition path|Field name|Average overlap|Maximum file overlap|Total file
> number|50% overlap|75% overlap|95% overlap|99% overlap|Total value number| |
> |path|c8|1.33|2|2|1|1|1|1|3| |
> This content provides a straightforward representation of the relevant
> statistics.
>
> For example, consider three files: a.parquet, b.parquet, and c.parquet.
> Taking an integer-type column 'id' as an example, the range (min-max) for 'a'
> is 1–5, for 'b' is 3–7, and for 'c' is 7–8. Thus, there will be overlap
> within the ranges 3–5 and 7.
> If the filter conditions for 'id' during data skipping include these values,
> multiple files will be filtered out. For a simpler case, if it's an equality
> query, 2 files will be filtered within these ranges, and no more than one
> file will be filtered in other cases (possibly outside of the range).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)