Ma Jian created HUDI-7166:
-----------------------------

             Summary: Providing metrics for archive and defining som string 
constants
                 Key: HUDI-7166
                 URL: https://issues.apache.org/jira/browse/HUDI-7166
             Project: Apache Hudi
          Issue Type: New Feature
            Reporter: Ma Jian


In [HUDI-7110] Add call procedure for show column stats information (#10120), a 
tool has been made available to display column stats. However, this tool is not 
very user-friendly for manual observation when dealing with large data volumes. 
For instance, with tens of thousands of parquet files, the number of rows in 
column stats could be of the order of hundreds of thousands. This renders the 
data virtually unreadable to humans, necessitating further processing by code. 
Yet, if an administrator simply wishes to directly observe the data layout 
based on column stats under such circumstances, a more intuitive display tool 
is required. Here, we offer a tool that calculates the overlap degree of column 
stats based on partition and column name.

 

Overlap degree refers to the extent to which the min-max ranges of different 
files intersect with each other. This directly affects the effectiveness of 
data skipping.

 

In fact, a similar concept is also provided by Snowflake to aid their 
clustering process. 
https://docs.snowflake.com/en/user-guide/tables-clustering-micropartitions Our 
implementation here is not overly complex.

 

It yields output similar to the following:
|Partition path|Field name|Average overlap|Maximum file overlap|Total file 
number|50% overlap|75% overlap|95% overlap|99% overlap|Total value number| |
|path|c8|1.33|2|2|1|1|1|1|3| |

This content provides a straightforward representation of the relevant 
statistics.

 

For example, consider three files: a.parquet, b.parquet, and c.parquet. Taking 
an integer-type column 'id' as an example, the range (min-max) for 'a' is 1–5, 
for 'b' is 3–7, and for 'c' is 7–8. Thus, there will be overlap within the 
ranges 3–5 and 7.

If the filter conditions for 'id' during data skipping include these values, 
multiple files will be filtered out. For a simpler case, if it's an equality 
query, 2 files will be filtered within these ranges, and no more than one file 
will be filtered in other cases (possibly outside of the range).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to