[
https://issues.apache.org/jira/browse/PARQUET-1685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16960878#comment-16960878
]
Gabor Szadovszky commented on PARQUET-1685:
-------------------------------------------
We implemented a similar feature for column indexes. We were able to that
because the
[specification|https://github.com/apache/parquet-format/blob/master/PageIndex.md#technical-approach]
allows it.
Unfortunately, we did not say anything like that for the min/max values in the
footer. It means that an implementation might rely on the fact that the min/max
values are actual values in the related page/rowgroup therefore, might
implement some logic accordingly.
I am not sure if we start truncating the values can cause any troubles in the
parquet implementation but it worth thinking about it and might require some
discussions on the dev list.
BTW, parquet-mr currently implements a 4k hard limit for statistics so empty
Statistics objects will be written to the footer if the min value + max value
exceeds this limit. Moreover, after 1.11.0 we will not write statistics into
the page headers so we are only talking about one Statistics object per
rowgroup. Does it really worth adding the truncation for the additional 4k (at
maximum) per rowgroup?
> Truncate the stored min and max for String statistics to reduce the footer
> size
> --------------------------------------------------------------------------------
>
> Key: PARQUET-1685
> URL: https://issues.apache.org/jira/browse/PARQUET-1685
> Project: Parquet
> Issue Type: Improvement
> Components: parquet-mr
> Affects Versions: 1.10.1
> Reporter: Xinli Shang
> Assignee: Xinli Shang
> Priority: Major
> Fix For: 1.12.0
>
>
> Iceberg has a cool feature that truncates the stored min, max statistics to
> minimize the metadata size. We can borrow to truncate them in Parquet also to
> reduce the size of the footer, or even the page header. Here is the code in
> IceBerg
> [https://github.com/apache/incubator-iceberg/blob/master/api/src/main/java/org/apache/iceberg/util/UnicodeUtil.java].
>
>
>
>
>
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)