Ryan Blue created PARQUET-372:
---------------------------------

             Summary: Parquet stats can have awkwardly large values
                 Key: PARQUET-372
                 URL: https://issues.apache.org/jira/browse/PARQUET-372
             Project: Parquet
          Issue Type: Bug
          Components: parquet-format, parquet-mr
            Reporter: Ryan Blue


If a column is storing very large values, say 2-4 MB, then the page header's 
min and max values can also be this large. It is wasteful to keep that much 
data in a page header, so we should look at options for decreasing the size 
required in these cases.

One idea is to truncate the size of binary data and change the last byte to 
0xFF (max) or 0x00 (min) to get a roughly equivalent min and max that isn't 
huge. This probably has some problems when the data stores multi-byte 
characters in UTF8 so we have to be careful and look into byte-wise sorting and 
UTF8.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to