[ 
https://issues.apache.org/jira/browse/PARQUET-1290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16461345#comment-16461345
 ] 

ASF GitHub Bot commented on PARQUET-1290:
-----------------------------------------

timarmstrong opened a new pull request #96: PARQUET-1290: clarify run lengths 
for RLE encoding
URL: https://github.com/apache/parquet-format/pull/96
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Clarify maximum run lengths for RLE encoding
> --------------------------------------------
>
>                 Key: PARQUET-1290
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1290
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-format
>            Reporter: Tim Armstrong
>            Priority: Major
>
> The Parquet spec isn't clear about what the upper bound on run lengths in the 
> RLE encoding is - 
> https://github.com/apache/parquet-format/blob/master/Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3
>  .
> It sounds like in practice that the major implementations don't support run 
> lengths > (2^31 - 1) - see 
> https://lists.apache.org/thread.html/6731a94a98b790ad24a9a5bb4e1bf9bb799d729e948e046efb40014f@%3Cdev.parquet.apache.org%3E
> I propose that we limit {{bit-pack-count}} and {{number of times repeated}} 
> to being <= 2^31.
> It seems unlikely that there are parquet files in existence with larger run 
> lengths, given that it requires huge numbers of values per page and major 
> implementations can't write or read such files without overflowing integers. 
> Maybe it would be possible if all the columns in a file were extremely 
> compressible, but it seems like in practice most implementations will hit 
> page or file size limits before producing a very-large run.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to