[jira] [Updated] (HUDI-7166) Provide a Procedure to Calculate Column Stats Overlap Degree

Ma Jian (Jira) Thu, 30 Nov 2023 22:32:59 -0800


     [ 
https://issues.apache.org/jira/browse/HUDI-7166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ma Jian updated HUDI-7166:
--------------------------
    Description: 
In HUDI-7110 , a tool has been made available to display column stats. However, 
this tool is not very user-friendly for manual observation when dealing with 
large data volumes. For instance, with tens of thousands of parquet files, the 
number of rows in column stats could be of the order of hundreds of thousands. 
This renders the data virtually unreadable to humans, necessitating further 
processing by code. Yet, if an administrator simply wishes to directly observe 
the data layout based on column stats under such circumstances, a more 
intuitive display tool is required. Here, we offer a tool that calculates the 
overlap degree of column stats based on partition and column name.

 

Overlap degree refers to the extent to which the min-max ranges of different 
files intersect with each other. This directly affects the effectiveness of 
data skipping.

 

In fact, a similar concept is also provided by Snowflake to aid their 
clustering process. 
[https://docs.snowflake.com/en/user-guide/tables-clustering-micropartitions] 
Our implementation here is not overly complex.

 

It yields output similar to the following:
|Partition path|Field name|Average overlap|Maximum file overlap|Total file 
number|50% overlap|75% overlap|95% overlap|99% overlap|Total value number| |
|path|c8|1.33|2|2|1|1|1|1|3| |

This content provides a straightforward representation of the relevant 
statistics.

 

For example, consider three files: a.parquet, b.parquet, and c.parquet. Taking 
an integer-type column 'id' as an example, the range (min-max) for 'a' is 1–5, 
for 'b' is 3–7, and for 'c' is 7–8. Thus, there will be overlap within the 
ranges 3–5 and 7.

If the filter conditions for 'id' during data skipping include these values, 
multiple files will be filtered out. For a simpler case, if it's an equality 
query, 2 files will be filtered within these ranges, and no more than one file 
will be filtered in other cases (possibly outside of the range).

  was:
In [HUDI-7110] Add call procedure for show column stats information (#10120), a 
tool has been made available to display column stats. However, this tool is not 
very user-friendly for manual observation when dealing with large data volumes. 
For instance, with tens of thousands of parquet files, the number of rows in 
column stats could be of the order of hundreds of thousands. This renders the 
data virtually unreadable to humans, necessitating further processing by code. 
Yet, if an administrator simply wishes to directly observe the data layout 
based on column stats under such circumstances, a more intuitive display tool 
is required. Here, we offer a tool that calculates the overlap degree of column 
stats based on partition and column name.

 

Overlap degree refers to the extent to which the min-max ranges of different 
files intersect with each other. This directly affects the effectiveness of 
data skipping.

 

In fact, a similar concept is also provided by Snowflake to aid their 
clustering process. 
https://docs.snowflake.com/en/user-guide/tables-clustering-micropartitions Our 
implementation here is not overly complex.

 

It yields output similar to the following:
|Partition path|Field name|Average overlap|Maximum file overlap|Total file 
number|50% overlap|75% overlap|95% overlap|99% overlap|Total value number| |
|path|c8|1.33|2|2|1|1|1|1|3| |

This content provides a straightforward representation of the relevant 
statistics.

 

For example, consider three files: a.parquet, b.parquet, and c.parquet. Taking 
an integer-type column 'id' as an example, the range (min-max) for 'a' is 1–5, 
for 'b' is 3–7, and for 'c' is 7–8. Thus, there will be overlap within the 
ranges 3–5 and 7.

If the filter conditions for 'id' during data skipping include these values, 
multiple files will be filtered out. For a simpler case, if it's an equality 
query, 2 files will be filtered within these ranges, and no more than one file 
will be filtered in other cases (possibly outside of the range).


> Provide a Procedure to Calculate Column Stats Overlap Degree
> ------------------------------------------------------------
>
>                 Key: HUDI-7166
>                 URL: https://issues.apache.org/jira/browse/HUDI-7166
>             Project: Apache Hudi
>          Issue Type: New Feature
>            Reporter: Ma Jian
>            Priority: Major
>
> In HUDI-7110 , a tool has been made available to display column stats. 
> However, this tool is not very user-friendly for manual observation when 
> dealing with large data volumes. For instance, with tens of thousands of 
> parquet files, the number of rows in column stats could be of the order of 
> hundreds of thousands. This renders the data virtually unreadable to humans, 
> necessitating further processing by code. Yet, if an administrator simply 
> wishes to directly observe the data layout based on column stats under such 
> circumstances, a more intuitive display tool is required. Here, we offer a 
> tool that calculates the overlap degree of column stats based on partition 
> and column name.
>  
> Overlap degree refers to the extent to which the min-max ranges of different 
> files intersect with each other. This directly affects the effectiveness of 
> data skipping.
>  
> In fact, a similar concept is also provided by Snowflake to aid their 
> clustering process. 
> [https://docs.snowflake.com/en/user-guide/tables-clustering-micropartitions] 
> Our implementation here is not overly complex.
>  
> It yields output similar to the following:
> |Partition path|Field name|Average overlap|Maximum file overlap|Total file 
> number|50% overlap|75% overlap|95% overlap|99% overlap|Total value number| |
> |path|c8|1.33|2|2|1|1|1|1|3| |
> This content provides a straightforward representation of the relevant 
> statistics.
>  
> For example, consider three files: a.parquet, b.parquet, and c.parquet. 
> Taking an integer-type column 'id' as an example, the range (min-max) for 'a' 
> is 1–5, for 'b' is 3–7, and for 'c' is 7–8. Thus, there will be overlap 
> within the ranges 3–5 and 7.
> If the filter conditions for 'id' during data skipping include these values, 
> multiple files will be filtered out. For a simpler case, if it's an equality 
> query, 2 files will be filtered within these ranges, and no more than one 
> file will be filtered in other cases (possibly outside of the range).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7166) Provide a Procedure to Calculate Column Stats Overlap Degree

Reply via email to