[
https://issues.apache.org/jira/browse/PARQUET-1685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16961177#comment-16961177
]
Gabor Szadovszky commented on PARQUET-1685:
-------------------------------------------
Hi [[email protected]],
I am fine with truncating the min/max values in the footer (for the rowgroups)
just wanted to highlight that the possibility of truncating is not specified
and it may cause troubles for the existing implementations.
In 1.11.0+ we always write column indexes and filter based on it if present by
default. (The column index based filtering can be turned off but the writing
can not.) Because we have the column indexes and because we never used the
statistics in the page headers (in parquet-mr and in parquet-cpp, as far as I
know) we are not writing them anymore.
> Truncate the stored min and max for String statistics to reduce the footer
> size
> --------------------------------------------------------------------------------
>
> Key: PARQUET-1685
> URL: https://issues.apache.org/jira/browse/PARQUET-1685
> Project: Parquet
> Issue Type: Improvement
> Components: parquet-mr
> Affects Versions: 1.10.1
> Reporter: Xinli Shang
> Assignee: Xinli Shang
> Priority: Major
> Fix For: 1.12.0
>
>
> Iceberg has a cool feature that truncates the stored min, max statistics to
> minimize the metadata size. We can borrow to truncate them in Parquet also to
> reduce the size of the footer, or even the page header. Here is the code in
> IceBerg
> [https://github.com/apache/incubator-iceberg/blob/master/api/src/main/java/org/apache/iceberg/util/UnicodeUtil.java].
>
>
>
>
>
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)