jorgecarleitao commented on a change in pull request #170: URL: https://github.com/apache/parquet-format/pull/170#discussion_r606702703
########## File path: rle-bitpacked.md ########## @@ -0,0 +1,120 @@ +# RLE-Bitpacked hybrid encoder + +The RLE-Bitpacked hybrid encoder is a parquet-specific encoder that combines +two well known encoding strategies, [RLE](https://en.wikipedia.org/wiki/Run-length_encoding) +and bitpacking. Note that "combine" here means this encoder allows both encodings +within the same stream, and, during encoding, it can switch between them. + +This encoder is only used to encode integer values that may either represent definition levels, +representation levels or ids of dictionary-encoded pages. Note that this encoder +supports integers that can be represented in less than 8 bits. + +This document uses [LSB](https://en.wikipedia.org/wiki/Bit_numbering#Least_significant_bit) +to identify bits. In this representation, a byte is represented +by `[b7 b6 b5 b4 b3 b2 b1 b0]` where `b0` is the first bit. + +This document uses MUST, SHOULD, etc. according to [RFC-8174](https://tools.ietf.org/html/rfc8174). + +## Decoding + +Decoding a stream of bytes (denoted as `[a1, a2, a3, ...]`) assumes a specific `bit_width` +indicating the number of bits necessary to represent the largest encoded integer in the stream. Review comment: The `bit_width` is pre-aranged through by other means. The decoder itself has no method of extracting the bit-width from the stream. For dictionary-encoded buffers, the bit-width of the indexes in the data page is provided on the first byte of the whole data buffer (after rep and def levels). For rep and def levels, the bit-width is computed from `ceil(log2(max_level))`. I.e. AFAIK the decoder is provided that prior to reading the stream, and the parquet format has different ways of encapsulating that information. ########## File path: rle-bitpacked.md ########## @@ -0,0 +1,120 @@ +# RLE-Bitpacked hybrid encoder + +The RLE-Bitpacked hybrid encoder is a parquet-specific encoder that combines +two well known encoding strategies, [RLE](https://en.wikipedia.org/wiki/Run-length_encoding) +and bitpacking. Note that "combine" here means this encoder allows both encodings +within the same stream, and, during encoding, it can switch between them. + +This encoder is only used to encode integer values that may either represent definition levels, +representation levels or ids of dictionary-encoded pages. Note that this encoder +supports integers that can be represented in less than 8 bits. + +This document uses [LSB](https://en.wikipedia.org/wiki/Bit_numbering#Least_significant_bit) +to identify bits. In this representation, a byte is represented +by `[b7 b6 b5 b4 b3 b2 b1 b0]` where `b0` is the first bit. + +This document uses MUST, SHOULD, etc. according to [RFC-8174](https://tools.ietf.org/html/rfc8174). + +## Decoding + +Decoding a stream of bytes (denoted as `[a1, a2, a3, ...]`) assumes a specific `bit_width` +indicating the number of bits necessary to represent the largest encoded integer in the stream. Review comment: The `bit_width` is pre-aranged by other means. The decoder itself has no method of extracting the bit-width from the stream. For dictionary-encoded buffers, the bit-width of the indexes in the data page is provided on the first byte of the whole data buffer (after rep and def levels). For rep and def levels, the bit-width is computed from `ceil(log2(max_level))`. I.e. AFAIK the decoder is provided that prior to reading the stream, and the parquet format has different ways of encapsulating that information. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
