[
https://issues.apache.org/jira/browse/HUDI-7166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Danny Chen updated HUDI-7166:
-----------------------------
Fix Version/s: 1.0.0
> Provide a Procedure to Calculate Column Stats Overlap Degree
> ------------------------------------------------------------
>
> Key: HUDI-7166
> URL: https://issues.apache.org/jira/browse/HUDI-7166
> Project: Apache Hudi
> Issue Type: New Feature
> Reporter: Ma Jian
> Priority: Major
> Labels: pull-request-available
> Fix For: 1.0.0
>
>
> In HUDI-7110 , a tool has been made available to display column stats.
> However, this tool is not very user-friendly for manual observation when
> dealing with large data volumes. For instance, with tens of thousands of
> parquet files, the number of rows in column stats could be of the order of
> hundreds of thousands. This renders the data virtually unreadable to humans,
> necessitating further processing by code. Yet, if an administrator simply
> wishes to directly observe the data layout based on column stats under such
> circumstances, a more intuitive display tool is required. Here, we offer a
> tool that calculates the overlap degree of column stats based on partition
> and column name.
>
> Overlap degree refers to the extent to which the min-max ranges of different
> files intersect with each other. This directly affects the effectiveness of
> data skipping.
>
> In fact, a similar concept is also provided by Snowflake to aid their
> clustering process.
> [https://docs.snowflake.com/en/user-guide/tables-clustering-micropartitions]
> Our implementation here is not overly complex.
>
> It yields output similar to the following:
> |Partition path|Field name|Average overlap|Maximum file overlap|Total file
> number|50% overlap|75% overlap|95% overlap|99% overlap|Total value number| |
> |path|c8|1.33|2|2|1|1|1|1|3| |
> This content provides a straightforward representation of the relevant
> statistics.
>
> For example, consider three files: a.parquet, b.parquet, and c.parquet.
> Taking an integer-type column 'id' as an example, the range (min-max) for 'a'
> is 1–5, for 'b' is 3–7, and for 'c' is 7–8. Thus, there will be overlap
> within the ranges 3–5 and 7.
> If the filter conditions for 'id' during data skipping include these values,
> multiple files will be filtered out. For a simpler case, if it's an equality
> query, 2 files will be filtered within these ranges, and no more than one
> file will be filtered in other cases (possibly outside of the range).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)