Ryan Blue created PARQUET-372:
---------------------------------
Summary: Parquet stats can have awkwardly large values
Key: PARQUET-372
URL: https://issues.apache.org/jira/browse/PARQUET-372
Project: Parquet
Issue Type: Bug
Components: parquet-format, parquet-mr
Reporter: Ryan Blue
If a column is storing very large values, say 2-4 MB, then the page header's
min and max values can also be this large. It is wasteful to keep that much
data in a page header, so we should look at options for decreasing the size
required in these cases.
One idea is to truncate the size of binary data and change the last byte to
0xFF (max) or 0x00 (min) to get a roughly equivalent min and max that isn't
huge. This probably has some problems when the data stores multi-byte
characters in UTF8 so we have to be careful and look into byte-wise sorting and
UTF8.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)