[
https://issues.apache.org/jira/browse/PARQUET-1685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16961165#comment-16961165
]
Xinli Shang commented on PARQUET-1685:
--------------------------------------
Hi [~gszadovszky] Thanks for your reply!
Regarding "an implementation might rely on the fact that the min/max values are
actual values", did you already have discussions earlier when the 'column
index' implemented the **statistics truncating? I would like to add [~rdblue]
who might already have discussions and thinkings because this is implemented in
IceBerg.
For the 4k hard limit, I am thinking from the other way. If empty statistics
were written because of oversizing statistics, it would cause the query
inefficient. And if truncating can improve(reduce) the size and reduce the
number of empty statistics files, then it is a big win.
In 1.11.0+, is it enforced to use the 'column index' and not to write to page
statistics?
> Truncate the stored min and max for String statistics to reduce the footer
> size
> --------------------------------------------------------------------------------
>
> Key: PARQUET-1685
> URL: https://issues.apache.org/jira/browse/PARQUET-1685
> Project: Parquet
> Issue Type: Improvement
> Components: parquet-mr
> Affects Versions: 1.10.1
> Reporter: Xinli Shang
> Assignee: Xinli Shang
> Priority: Major
> Fix For: 1.12.0
>
>
> Iceberg has a cool feature that truncates the stored min, max statistics to
> minimize the metadata size. We can borrow to truncate them in Parquet also to
> reduce the size of the footer, or even the page header. Here is the code in
> IceBerg
> [https://github.com/apache/incubator-iceberg/blob/master/api/src/main/java/org/apache/iceberg/util/UnicodeUtil.java].
>
>
>
>
>
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)