[
https://issues.apache.org/jira/browse/HUDI-7166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ma Jian updated HUDI-7166:
--------------------------
Description:
In HUDI-7110 , a tool has been made available to display column stats. However,
this tool is not very user-friendly for manual observation when dealing with
large data volumes. For instance, with tens of thousands of parquet files, the
number of rows in column stats could be of the order of hundreds of thousands.
This renders the data virtually unreadable to humans, necessitating further
processing by code. Yet, if an administrator simply wishes to directly observe
the data layout based on column stats under such circumstances, a more
intuitive display tool is required. Here, we offer a tool that calculates the
overlap degree of column stats based on partition and column name.
Overlap degree refers to the extent to which the min-max ranges of different
files intersect with each other. This directly affects the effectiveness of
data skipping.
In fact, a similar concept is also provided by Snowflake to aid their
clustering process.
[https://docs.snowflake.com/en/user-guide/tables-clustering-micropartitions]
Our implementation here is not overly complex.
It yields output similar to the following:
|Partition path|Field name|Average overlap|Maximum file overlap|Total file
number|50% overlap|75% overlap|95% overlap|99% overlap|Total value number| |
|path|c8|1.33|2|2|1|1|1|1|3| |
This content provides a straightforward representation of the relevant
statistics.
For example, consider three files: a.parquet, b.parquet, and c.parquet. Taking
an integer-type column 'id' as an example, the range (min-max) for 'a' is 1–5,
for 'b' is 3–7, and for 'c' is 7–8. Thus, there will be overlap within the
ranges 3–5 and 7.
If the filter conditions for 'id' during data skipping include these values,
multiple files will be filtered out. For a simpler case, if it's an equality
query, 2 files will be filtered within these ranges, and no more than one file
will be filtered in other cases (possibly outside of the range).
was:
In [HUDI-7110] Add call procedure for show column stats information (#10120), a
tool has been made available to display column stats. However, this tool is not
very user-friendly for manual observation when dealing with large data volumes.
For instance, with tens of thousands of parquet files, the number of rows in
column stats could be of the order of hundreds of thousands. This renders the
data virtually unreadable to humans, necessitating further processing by code.
Yet, if an administrator simply wishes to directly observe the data layout
based on column stats under such circumstances, a more intuitive display tool
is required. Here, we offer a tool that calculates the overlap degree of column
stats based on partition and column name.
Overlap degree refers to the extent to which the min-max ranges of different
files intersect with each other. This directly affects the effectiveness of
data skipping.
In fact, a similar concept is also provided by Snowflake to aid their
clustering process.
https://docs.snowflake.com/en/user-guide/tables-clustering-micropartitions Our
implementation here is not overly complex.
It yields output similar to the following:
|Partition path|Field name|Average overlap|Maximum file overlap|Total file
number|50% overlap|75% overlap|95% overlap|99% overlap|Total value number| |
|path|c8|1.33|2|2|1|1|1|1|3| |
This content provides a straightforward representation of the relevant
statistics.
For example, consider three files: a.parquet, b.parquet, and c.parquet. Taking
an integer-type column 'id' as an example, the range (min-max) for 'a' is 1–5,
for 'b' is 3–7, and for 'c' is 7–8. Thus, there will be overlap within the
ranges 3–5 and 7.
If the filter conditions for 'id' during data skipping include these values,
multiple files will be filtered out. For a simpler case, if it's an equality
query, 2 files will be filtered within these ranges, and no more than one file
will be filtered in other cases (possibly outside of the range).
> Provide a Procedure to Calculate Column Stats Overlap Degree
> ------------------------------------------------------------
>
> Key: HUDI-7166
> URL: https://issues.apache.org/jira/browse/HUDI-7166
> Project: Apache Hudi
> Issue Type: New Feature
> Reporter: Ma Jian
> Priority: Major
>
> In HUDI-7110 , a tool has been made available to display column stats.
> However, this tool is not very user-friendly for manual observation when
> dealing with large data volumes. For instance, with tens of thousands of
> parquet files, the number of rows in column stats could be of the order of
> hundreds of thousands. This renders the data virtually unreadable to humans,
> necessitating further processing by code. Yet, if an administrator simply
> wishes to directly observe the data layout based on column stats under such
> circumstances, a more intuitive display tool is required. Here, we offer a
> tool that calculates the overlap degree of column stats based on partition
> and column name.
>
> Overlap degree refers to the extent to which the min-max ranges of different
> files intersect with each other. This directly affects the effectiveness of
> data skipping.
>
> In fact, a similar concept is also provided by Snowflake to aid their
> clustering process.
> [https://docs.snowflake.com/en/user-guide/tables-clustering-micropartitions]
> Our implementation here is not overly complex.
>
> It yields output similar to the following:
> |Partition path|Field name|Average overlap|Maximum file overlap|Total file
> number|50% overlap|75% overlap|95% overlap|99% overlap|Total value number| |
> |path|c8|1.33|2|2|1|1|1|1|3| |
> This content provides a straightforward representation of the relevant
> statistics.
>
> For example, consider three files: a.parquet, b.parquet, and c.parquet.
> Taking an integer-type column 'id' as an example, the range (min-max) for 'a'
> is 1–5, for 'b' is 3–7, and for 'c' is 7–8. Thus, there will be overlap
> within the ranges 3–5 and 7.
> If the filter conditions for 'id' during data skipping include these values,
> multiple files will be filtered out. For a simpler case, if it's an equality
> query, 2 files will be filtered within these ranges, and no more than one
> file will be filtered in other cases (possibly outside of the range).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)