The current RLE coding has bit-packing baked into it, so I'm wondering what it even means to bit-pack a lot of the types, particularly if you don't have bounds on the range of values.
I can see if you have a logic int8 column stored in an int32, you have bounds on the values, so bit-packing would let you pack things more densely But if you have a int64 column, do you just store the 64 bit values back-to-back? Is that different from the plain encoding? Or do you select a bitwidth per page and store that in the page header? We also can't bit-pack types like strings at all. I guess based on that and Ryan's observation about negative numbers, it sounds like getting a quality RLE encoding for isn't a trivial extension of the current encoding and needs some thought. On Wed, Dec 6, 2017 at 2:33 PM, Ryan Blue <[email protected]> wrote: > There isn't anything that I know of that would prevent this from working. I > think the Java library would even read the data successfully because it > allows pages (usually dictionary-encoded ones) to be RLE encoded. > > The main problem with this is that the RLE encoding is unaware of negative > values. Any negative number causes the entire data page to be stored with > plain encoding because the most-significant bit is set. So there's just no > benefit to doing it. > > The fact that we don't have an encoding that takes advantage of smaller > widths is why I proposed a variant of the RLE codec a while back. > Basically, it makes all numbers positive by zig-zag encoding (moving the > sign bit to the lsb) and then allows the RLE encoding to change packing > width with an extra byte. I think this would be a good one to add for v2, > but this is obviously a separate issue. > > rb > > On Wed, Dec 6, 2017 at 1:58 PM, Wes McKinney <[email protected]> wrote: > > > Sorry, to clarify, in this question: > > > > > > 1) Was RLE (the Hybrid-bitpacked RLE encoder used for > > repetition/definition levels) ever intended for use for encoding data > > pages in the Parquet V1 format? > > > > I meant for encoding data pages that do not contain dictionary indices > > (i.e. as an alternative to PLAIN or PLAIN_DICTIONARY/RLE_DICTIONARY) > > > > On Wed, Dec 6, 2017 at 4:53 PM, Wes McKinney <[email protected]> > wrote: > > > We had a discussion recently [1] in which a Python implementation of > > > Parquet had used the RLE encoding type for encoding the data pages for > > > INT32 values with UINT_8 logical type (non dictionary-encoded). > > > > > > In the Encodings.md document [3] in the Parquet format, it is not > > > strictly indicated that the RLE encoding is to be used for > > > definition/repetition levels and boolean, though that is all that is > > > supported in parquet-mr [4], parquet-cpp, Impala [5], and other > > > implementations. > > > > > > So questions: > > > > > > 1) Was RLE (the Hybrid-bitpacked RLE encoder used for > > > repetition/definition levels) ever intended for use for encoding data > > > pages in the Parquet V1 format? > > > > > > 2) Whether yes or no, should we update apache/parquet-format to be > > > more explicit about the purpose and scope of this encoding? > > > > > > Thanks, > > > Wes > > > > > > [1]: https://github.com/dask/fastparquet/issues/256 > > > [2]: https://github.com/dask/fastparquet > > > [3]: https://github.com/apache/parquet-format/blob/master/Encodings.md > > > [4]: https://github.com/apache/parquet-mr/blob/master/ > > parquet-column/src/main/java/org/apache/parquet/column/ > Encoding.java#L115 > > > [5]: https://github.com/apache/impala/blob/master/be/src/ > > exec/parquet-column-readers.cc#L495 > > > > > > -- > Ryan Blue > Software Engineer > Netflix >
