Ma Jian created HUDI-7166:
-----------------------------
Summary: Providing metrics for archive and defining som string
constants
Key: HUDI-7166
URL: https://issues.apache.org/jira/browse/HUDI-7166
Project: Apache Hudi
Issue Type: New Feature
Reporter: Ma Jian
In [HUDI-7110] Add call procedure for show column stats information (#10120), a
tool has been made available to display column stats. However, this tool is not
very user-friendly for manual observation when dealing with large data volumes.
For instance, with tens of thousands of parquet files, the number of rows in
column stats could be of the order of hundreds of thousands. This renders the
data virtually unreadable to humans, necessitating further processing by code.
Yet, if an administrator simply wishes to directly observe the data layout
based on column stats under such circumstances, a more intuitive display tool
is required. Here, we offer a tool that calculates the overlap degree of column
stats based on partition and column name.
Overlap degree refers to the extent to which the min-max ranges of different
files intersect with each other. This directly affects the effectiveness of
data skipping.
In fact, a similar concept is also provided by Snowflake to aid their
clustering process.
https://docs.snowflake.com/en/user-guide/tables-clustering-micropartitions Our
implementation here is not overly complex.
It yields output similar to the following:
|Partition path|Field name|Average overlap|Maximum file overlap|Total file
number|50% overlap|75% overlap|95% overlap|99% overlap|Total value number| |
|path|c8|1.33|2|2|1|1|1|1|3| |
This content provides a straightforward representation of the relevant
statistics.
For example, consider three files: a.parquet, b.parquet, and c.parquet. Taking
an integer-type column 'id' as an example, the range (min-max) for 'a' is 1–5,
for 'b' is 3–7, and for 'c' is 7–8. Thus, there will be overlap within the
ranges 3–5 and 7.
If the filter conditions for 'id' during data skipping include these values,
multiple files will be filtered out. For a simpler case, if it's an equality
query, 2 files will be filtered within these ranges, and no more than one file
will be filtered in other cases (possibly outside of the range).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)