[
https://issues.apache.org/jira/browse/PARQUET-1290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16461333#comment-16461333
]
Tim Armstrong commented on PARQUET-1290:
----------------------------------------
I can take this on if someone will assign it to me.
> Clarify maximum run lengths for RLE encoding
> --------------------------------------------
>
> Key: PARQUET-1290
> URL: https://issues.apache.org/jira/browse/PARQUET-1290
> Project: Parquet
> Issue Type: Improvement
> Components: parquet-format
> Reporter: Tim Armstrong
> Priority: Major
>
> The Parquet spec isn't clear about what the upper bound on run lengths in the
> RLE encoding is -
> https://github.com/apache/parquet-format/blob/master/Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3
> .
> It sounds like in practice that the major implementations don't support run
> lengths > (2^31 - 1) - see
> https://lists.apache.org/thread.html/6731a94a98b790ad24a9a5bb4e1bf9bb799d729e948e046efb40014f@%3Cdev.parquet.apache.org%3E
> I propose that we limit {{bit-pack-count}} and {{number of times repeated}}
> to being <= 2^31.
> It seems unlikely that there are parquet files in existence with larger run
> lengths, given that it requires huge numbers of values per page and major
> implementations can't write or read such files without overflowing integers.
> Maybe it would be possible if all the columns in a file were extremely
> compressible, but it seems like in practice most implementations will hit
> page or file size limits before producing a very-large run.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)