Tim Armstrong created PARQUET-1290:
--------------------------------------

             Summary: Clarify maximum run lengths for RLE encoding
                 Key: PARQUET-1290
                 URL: https://issues.apache.org/jira/browse/PARQUET-1290
             Project: Parquet
          Issue Type: Improvement
          Components: parquet-format
            Reporter: Tim Armstrong


The Parquet spec isn't clear about what the upper bound on run lengths in the 
RLE encoding is - 
https://github.com/apache/parquet-format/blob/master/Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3
 .
It sounds like in practice that the major implementations don't support run 
lengths > (2^31 - 1) - see 
https://lists.apache.org/thread.html/6731a94a98b790ad24a9a5bb4e1bf9bb799d729e948e046efb40014f@%3Cdev.parquet.apache.org%3E

I propose that we limit {{bit-pack-count}} and {{number of times repeated}} to 
being <= 2^31.

It seems unlikely that there are parquet files in existence with larger run 
lengths, given that it requires huge numbers of values per page and major 
implementations can't write or read such files without overflowing integers. 
Maybe it would be possible if all the columns in a file were extremely 
compressible, but it seems like in practice most implementations will hit page 
or file size limits before producing a very-large run.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to