[
https://issues.apache.org/jira/browse/PARQUET-372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ryan Blue updated PARQUET-372:
--
Description: If a column is storing very large values, say 2-4 MB, then the
page header's min and max values can also be this large. (was: If a column is
storing very large values, say 2-4 MB, then the page header's min and max
values can also be this large. It is wasteful to keep that much data in a page
header, so we should look at options for decreasing the size required in these
cases.
One idea is to truncate the size of binary data and change the last byte to
0xFF (max) or 0x00 (min) to get a roughly equivalent min and max that isn't
huge. This probably has some problems when the data stores multi-byte
characters in UTF8 so we have to be careful and look into byte-wise sorting and
UTF8.)
> Parquet stats can have awkwardly large values
> -
>
> Key: PARQUET-372
> URL: https://issues.apache.org/jira/browse/PARQUET-372
> Project: Parquet
> Issue Type: Bug
> Components: parquet-format, parquet-mr
>Reporter: Ryan Blue
>
> If a column is storing very large values, say 2-4 MB, then the page header's
> min and max values can also be this large.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)