[ 
https://issues.apache.org/jira/browse/PARQUET-1290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16466165#comment-16466165
 ] 

ASF GitHub Bot commented on PARQUET-1290:
-----------------------------------------

rdblue closed pull request #96: PARQUET-1290: clarify run lengths for RLE 
encoding
URL: https://github.com/apache/parquet-format/pull/96
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/Encodings.md b/Encodings.md
index f3b8d50b..9358b137 100644
--- a/Encodings.md
+++ b/Encodings.md
@@ -72,12 +72,14 @@ length := length of the <encoded-data> in bytes stored as 4 
bytes little endian
 encoded-data := <run>*
 run := <bit-packed-run> | <rle-run>
 bit-packed-run := <bit-packed-header> <bit-packed-values>
-bit-packed-header := varint-encode(<bit-pack-count> << 1 | 1)
+bit-packed-header := varint-encode(<bit-pack-scaled-run-len> << 1 | 1)
 // we always bit-pack a multiple of 8 values at a time, so we only store the 
number of values / 8
-bit-pack-count := (number of values in this run) / 8
+bit-pack-scaled-run-len := (bit-packed-run-len) / 8
+bit-packed-run-len := *see 3 below*
 bit-packed-values := *see 1 below*
 rle-run := <rle-header> <repeated-value>
-rle-header := varint-encode( (number of times repeated) << 1)
+rle-header := varint-encode( (rle-run-len) << 1)
+rle-run-len := *see 3 below*
 repeated-value := value that is repeated, using a fixed-width of 
round-up-to-next-byte(bit-width)
 ```
 
@@ -107,6 +109,13 @@ repeated-value := value that is repeated, using a 
fixed-width of round-up-to-nex
 
 2. varint-encode() is ULEB-128 encoding, see 
https://en.wikipedia.org/wiki/LEB128
 
+3. bit-packed-run-len and rle-run-len must be in the range \[1, 2<sup>31</sup> 
- 1\].
+   This means that a Parquet implementation can always store the run length in 
a signed
+   32-bit integer. This length restriction was not part of the Parquet 2.5.0 
and earlier
+   specifications, but longer runs were not readable by the most common Parquet
+   implementations so, in practice, were not safe for Parquet writers to emit.
+
+
 Note that the RLE encoding method is only supported for the following types of
 data:
 


 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> Clarify maximum run lengths for RLE encoding
> --------------------------------------------
>
>                 Key: PARQUET-1290
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1290
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-format
>            Reporter: Tim Armstrong
>            Assignee: Tim Armstrong
>            Priority: Major
>
> The Parquet spec isn't clear about what the upper bound on run lengths in the 
> RLE encoding is - 
> https://github.com/apache/parquet-format/blob/master/Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3
>  .
> It sounds like in practice that the major implementations don't support run 
> lengths > (2^31 - 1) - see 
> https://lists.apache.org/thread.html/6731a94a98b790ad24a9a5bb4e1bf9bb799d729e948e046efb40014f@%3Cdev.parquet.apache.org%3E
> I propose that we limit {{bit-pack-count}} and {{number of times repeated}} 
> to being <= 2^31.
> It seems unlikely that there are parquet files in existence with larger run 
> lengths, given that it requires huge numbers of values per page and major 
> implementations can't write or read such files without overflowing integers. 
> Maybe it would be possible if all the columns in a file were extremely 
> compressible, but it seems like in practice most implementations will hit 
> page or file size limits before producing a very-large run.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to