jorgecarleitao commented on a change in pull request #170:
URL: https://github.com/apache/parquet-format/pull/170#discussion_r606749884



##########
File path: rle-bitpacked.md
##########
@@ -0,0 +1,120 @@
+# RLE-Bitpacked hybrid encoder
+
+The RLE-Bitpacked hybrid encoder is a parquet-specific encoder that combines
+two well known encoding strategies, 
[RLE](https://en.wikipedia.org/wiki/Run-length_encoding)
+and bitpacking. Note that "combine" here means this encoder allows both 
encodings
+within the same stream, and, during encoding, it can switch between them.
+
+This encoder is only used to encode integer values that may either represent 
definition levels,
+representation levels or ids of dictionary-encoded pages. Note that this 
encoder
+supports integers that can be represented in less than 8 bits.
+
+This document uses 
[LSB](https://en.wikipedia.org/wiki/Bit_numbering#Least_significant_bit)
+to identify bits. In this representation, a byte is represented
+by `[b7 b6 b5 b4 b3 b2 b1 b0]` where `b0` is the first bit.
+
+This document uses MUST, SHOULD, etc. according to 
[RFC-8174](https://tools.ietf.org/html/rfc8174).
+
+## Decoding
+
+Decoding a stream of bytes (denoted as `[a1, a2, a3, ...]`) assumes a specific 
`bit_width`
+indicating the number of bits necessary to represent the largest encoded 
integer in the stream.
+
+The `bit_width` MUST be sufficient to represent the largest encoded integer on 
the
+stream or the result is undefined.
+
+The first 4 bytes of the stream MUST represent a little-endian unsigned 
integer (`uint32`)
+denoting the length of the rest of the stream. For example, `[4u8, 0, 0, 0]` 
announces
+that the stream has a total of `4 + 4 = 8` bytes. The first 4 bytes are only 
used for this purpose.
+
+The remaining bytes are divided in "runs", which MUST be either RLE-encoded or 
bitpacked-encoded.
+Each "run" MUST be composed by a header of a variable number of bytes and by a 
body in sequence.
+I.e. `[h11, h12, h13, b11, b12, ...]` where `h11` is the first byte of the 
header
+of run `1`, `h12` the second byte of the header, `b11` is the first byte of 
the body of the first run.
+
+The header MUST be a single 
[ULEB128](https://en.wikipedia.org/wiki/LEB128#Unsigned_LEB128)-encoded `i32`,
+here denoted as `h1`. The bytes needed to decode the ULEB128-encoded 
constitute the header
+and the remaining bytes constitute the body.
+
+The first bit of the last byte of the header denotes whether the run is 
bitpacked-encoded
+or RLE-encoded. `h1 & 1 == 1` indicates a bitpacked-encoded run, `h1 & 1 != 1` 
a RLE-encoded run.
+
+### Decoding RLE-encoded runs
+
+Given a header `h1` and the stream of bytes past the header, the number of 
repetitions
+is given by `repetitions = h1 >> 1`. The number of bytes in the body, 
`body_length`,
+MUST be the minimum number of bytes that can hold `bit_width`, `body_length = 
ceil(bit_width, 8)`.
+The body MUST represent the repeated value in little endian for byte types 
(e.g. `int32`) and
+[LSB](https://en.wikipedia.org/wiki/Bit_numbering#Least_significant_bit) for 
boolean types.
+
+### Decoding bitpacked-encoded runs
+
+Given a header `h1` and the stream of bytes past the header, the number of 
bytes
+in the body, `body_length`, is equal to `body_length = h1 >> 1`.
+The body represents bitpacked-encoded values with exactly `8 / bit_width` 
items.
+
+Note that for `bit_width = 1`, this encoding corresponds exactly to LSB.
+Thus, in-memory formats that store booleans or validities in `LSB` (e.g. 
Apache Arrow),
+the bodies can be memcopied _as is_ to a LSB buffer.
+
+### Example
+
+Let's now consider the the following byte stream and a `bit_width = 1`,
+
+```
+[
+    5, 0, 0, 0,
+    0b00000101, 0b11101011, 0b00000010,
+    0b00010000, 0b00000001,
+    0b00000101,
+    0b00000101,
+]
+```
+
+which could appear in a definition level of a non-nested Parquet type.
+
+We use the first 4 bytes to identify relevant length, within the stream,
+of the encoded bytes. This corresponds to `[5, 0, 0, 0]`, which, in little
+endian, corresponds to `5u32`. We can thus assume that all runs of this encoded
+stream are encoded in 

Review comment:
       was just to allude that a decoder may receive a stream of bytes larger 
than what it should decode. Page buffers in parquet are organized as `<encoded 
def levels> <encoded rep levels> <data>`. When decoding a page, the 
RLE-bitpacked encoder must not cross beyond its own boundaries.
   
   Not so relevant though, we can remove if it adds complexity.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to