[
https://issues.apache.org/jira/browse/PARQUET-1290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16466114#comment-16466114
]
ASF GitHub Bot commented on PARQUET-1290:
-----------------------------------------
timarmstrong commented on issue #96: PARQUET-1290: clarify run lengths for RLE
encoding
URL: https://github.com/apache/parquet-format/pull/96#issuecomment-387124624
Any more comments? Can this be merged?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Clarify maximum run lengths for RLE encoding
> --------------------------------------------
>
> Key: PARQUET-1290
> URL: https://issues.apache.org/jira/browse/PARQUET-1290
> Project: Parquet
> Issue Type: Improvement
> Components: parquet-format
> Reporter: Tim Armstrong
> Assignee: Tim Armstrong
> Priority: Major
>
> The Parquet spec isn't clear about what the upper bound on run lengths in the
> RLE encoding is -
> https://github.com/apache/parquet-format/blob/master/Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3
> .
> It sounds like in practice that the major implementations don't support run
> lengths > (2^31 - 1) - see
> https://lists.apache.org/thread.html/6731a94a98b790ad24a9a5bb4e1bf9bb799d729e948e046efb40014f@%3Cdev.parquet.apache.org%3E
> I propose that we limit {{bit-pack-count}} and {{number of times repeated}}
> to being <= 2^31.
> It seems unlikely that there are parquet files in existence with larger run
> lengths, given that it requires huge numbers of values per page and major
> implementations can't write or read such files without overflowing integers.
> Maybe it would be possible if all the columns in a file were extremely
> compressible, but it seems like in practice most implementations will hit
> page or file size limits before producing a very-large run.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)