[ 
https://issues.apache.org/jira/browse/HUDI-7166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7166:
---------------------------------
    Labels: pull-request-available  (was: )

> Provide a Procedure to Calculate Column Stats Overlap Degree
> ------------------------------------------------------------
>
>                 Key: HUDI-7166
>                 URL: https://issues.apache.org/jira/browse/HUDI-7166
>             Project: Apache Hudi
>          Issue Type: New Feature
>            Reporter: Ma Jian
>            Priority: Major
>              Labels: pull-request-available
>
> In HUDI-7110 , a tool has been made available to display column stats. 
> However, this tool is not very user-friendly for manual observation when 
> dealing with large data volumes. For instance, with tens of thousands of 
> parquet files, the number of rows in column stats could be of the order of 
> hundreds of thousands. This renders the data virtually unreadable to humans, 
> necessitating further processing by code. Yet, if an administrator simply 
> wishes to directly observe the data layout based on column stats under such 
> circumstances, a more intuitive display tool is required. Here, we offer a 
> tool that calculates the overlap degree of column stats based on partition 
> and column name.
>  
> Overlap degree refers to the extent to which the min-max ranges of different 
> files intersect with each other. This directly affects the effectiveness of 
> data skipping.
>  
> In fact, a similar concept is also provided by Snowflake to aid their 
> clustering process. 
> [https://docs.snowflake.com/en/user-guide/tables-clustering-micropartitions] 
> Our implementation here is not overly complex.
>  
> It yields output similar to the following:
> |Partition path|Field name|Average overlap|Maximum file overlap|Total file 
> number|50% overlap|75% overlap|95% overlap|99% overlap|Total value number| |
> |path|c8|1.33|2|2|1|1|1|1|3| |
> This content provides a straightforward representation of the relevant 
> statistics.
>  
> For example, consider three files: a.parquet, b.parquet, and c.parquet. 
> Taking an integer-type column 'id' as an example, the range (min-max) for 'a' 
> is 1–5, for 'b' is 3–7, and for 'c' is 7–8. Thus, there will be overlap 
> within the ranges 3–5 and 7.
> If the filter conditions for 'id' during data skipping include these values, 
> multiple files will be filtered out. For a simpler case, if it's an equality 
> query, 2 files will be filtered within these ranges, and no more than one 
> file will be filtered in other cases (possibly outside of the range).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to