[ 
https://issues.apache.org/jira/browse/HUDI-7166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ma Jian updated HUDI-7166:
--------------------------
    Summary: Provide a Procedure to Calculate Column Stats Overlap Degree  
(was: Providing metrics for archive and defining som string constants)

> Provide a Procedure to Calculate Column Stats Overlap Degree
> ------------------------------------------------------------
>
>                 Key: HUDI-7166
>                 URL: https://issues.apache.org/jira/browse/HUDI-7166
>             Project: Apache Hudi
>          Issue Type: New Feature
>            Reporter: Ma Jian
>            Priority: Major
>
> In [HUDI-7110] Add call procedure for show column stats information (#10120), 
> a tool has been made available to display column stats. However, this tool is 
> not very user-friendly for manual observation when dealing with large data 
> volumes. For instance, with tens of thousands of parquet files, the number of 
> rows in column stats could be of the order of hundreds of thousands. This 
> renders the data virtually unreadable to humans, necessitating further 
> processing by code. Yet, if an administrator simply wishes to directly 
> observe the data layout based on column stats under such circumstances, a 
> more intuitive display tool is required. Here, we offer a tool that 
> calculates the overlap degree of column stats based on partition and column 
> name.
>  
> Overlap degree refers to the extent to which the min-max ranges of different 
> files intersect with each other. This directly affects the effectiveness of 
> data skipping.
>  
> In fact, a similar concept is also provided by Snowflake to aid their 
> clustering process. 
> https://docs.snowflake.com/en/user-guide/tables-clustering-micropartitions 
> Our implementation here is not overly complex.
>  
> It yields output similar to the following:
> |Partition path|Field name|Average overlap|Maximum file overlap|Total file 
> number|50% overlap|75% overlap|95% overlap|99% overlap|Total value number| |
> |path|c8|1.33|2|2|1|1|1|1|3| |
> This content provides a straightforward representation of the relevant 
> statistics.
>  
> For example, consider three files: a.parquet, b.parquet, and c.parquet. 
> Taking an integer-type column 'id' as an example, the range (min-max) for 'a' 
> is 1–5, for 'b' is 3–7, and for 'c' is 7–8. Thus, there will be overlap 
> within the ranges 3–5 and 7.
> If the filter conditions for 'id' during data skipping include these values, 
> multiple files will be filtered out. For a simpler case, if it's an equality 
> query, 2 files will be filtered within these ranges, and no more than one 
> file will be filtered in other cases (possibly outside of the range).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to