[ 
https://issues.apache.org/jira/browse/PARQUET-1290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16461333#comment-16461333
 ] 

Tim Armstrong commented on PARQUET-1290:
----------------------------------------

I can take this on if someone will assign it to me.

> Clarify maximum run lengths for RLE encoding
> --------------------------------------------
>
>                 Key: PARQUET-1290
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1290
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-format
>            Reporter: Tim Armstrong
>            Priority: Major
>
> The Parquet spec isn't clear about what the upper bound on run lengths in the 
> RLE encoding is - 
> https://github.com/apache/parquet-format/blob/master/Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3
>  .
> It sounds like in practice that the major implementations don't support run 
> lengths > (2^31 - 1) - see 
> https://lists.apache.org/thread.html/6731a94a98b790ad24a9a5bb4e1bf9bb799d729e948e046efb40014f@%3Cdev.parquet.apache.org%3E
> I propose that we limit {{bit-pack-count}} and {{number of times repeated}} 
> to being <= 2^31.
> It seems unlikely that there are parquet files in existence with larger run 
> lengths, given that it requires huge numbers of values per page and major 
> implementations can't write or read such files without overflowing integers. 
> Maybe it would be possible if all the columns in a file were extremely 
> compressible, but it seems like in practice most implementations will hit 
> page or file size limits before producing a very-large run.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to