[ 
https://issues.apache.org/jira/browse/PARQUET-1290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16462012#comment-16462012
 ] 

ASF GitHub Bot commented on PARQUET-1290:
-----------------------------------------

xhochy commented on a change in pull request #96: PARQUET-1290: clarify run 
lengths for RLE encoding
URL: https://github.com/apache/parquet-format/pull/96#discussion_r185707300
 
 

 ##########
 File path: Encodings.md
 ##########
 @@ -72,15 +72,16 @@ length := length of the <encoded-data> in bytes stored as 
4 bytes little endian
 encoded-data := <run>*
 run := <bit-packed-run> | <rle-run>
 bit-packed-run := <bit-packed-header> <bit-packed-values>
-bit-packed-header := varint-encode(<bit-pack-count> << 1 | 1)
+bit-packed-header := varint-encode(<bit-pack-scaled-run-len> << 1 | 1)
 // we always bit-pack a multiple of 8 values at a time, so we only store the 
number of values / 8
-bit-pack-count := (number of values in this run) / 8
+bit-pack-scaled-run-len := (bit-packed-run-len) / 8
+bit-packed-run-len := *see 3 below*
 bit-packed-values := *see 1 below*
 rle-run := <rle-header> <repeated-value>
-rle-header := varint-encode( (number of times repeated) << 1)
+rle-header := varint-encode( (rle-run-len) << 1)
+rle-run-len := *see 3 below*
 repeated-value := value that is repeated, using a fixed-width of 
round-up-to-next-byte(bit-width)
 ```
-
 
 Review comment:
   Can you readd this blank line?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> Clarify maximum run lengths for RLE encoding
> --------------------------------------------
>
>                 Key: PARQUET-1290
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1290
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-format
>            Reporter: Tim Armstrong
>            Priority: Major
>
> The Parquet spec isn't clear about what the upper bound on run lengths in the 
> RLE encoding is - 
> https://github.com/apache/parquet-format/blob/master/Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3
>  .
> It sounds like in practice that the major implementations don't support run 
> lengths > (2^31 - 1) - see 
> https://lists.apache.org/thread.html/6731a94a98b790ad24a9a5bb4e1bf9bb799d729e948e046efb40014f@%3Cdev.parquet.apache.org%3E
> I propose that we limit {{bit-pack-count}} and {{number of times repeated}} 
> to being <= 2^31.
> It seems unlikely that there are parquet files in existence with larger run 
> lengths, given that it requires huge numbers of values per page and major 
> implementations can't write or read such files without overflowing integers. 
> Maybe it would be possible if all the columns in a file were extremely 
> compressible, but it seems like in practice most implementations will hit 
> page or file size limits before producing a very-large run.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to