Tim Armstrong created PARQUET-1290:
--------------------------------------
Summary: Clarify maximum run lengths for RLE encoding
Key: PARQUET-1290
URL: https://issues.apache.org/jira/browse/PARQUET-1290
Project: Parquet
Issue Type: Improvement
Components: parquet-format
Reporter: Tim Armstrong
The Parquet spec isn't clear about what the upper bound on run lengths in the
RLE encoding is -
https://github.com/apache/parquet-format/blob/master/Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3
.
It sounds like in practice that the major implementations don't support run
lengths > (2^31 - 1) - see
https://lists.apache.org/thread.html/6731a94a98b790ad24a9a5bb4e1bf9bb799d729e948e046efb40014f@%3Cdev.parquet.apache.org%3E
I propose that we limit {{bit-pack-count}} and {{number of times repeated}} to
being <= 2^31.
It seems unlikely that there are parquet files in existence with larger run
lengths, given that it requires huge numbers of values per page and major
implementations can't write or read such files without overflowing integers.
Maybe it would be possible if all the columns in a file were extremely
compressible, but it seems like in practice most implementations will hit page
or file size limits before producing a very-large run.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)