I think the issue is that in the library (dask/fastparquet) where this came up, dictionary encoding in general has not been implemented. So for unsigned 8-bit integer, since you can use RLE with bit width 8 to encode such data, this is being used as an alternative to PLAIN encoding. But since UINT_8 is only a logical type the annotates INT32, the RLE encoding as it's defined now cannot be used in general to encode INT32.
I would suggest that we make a minor revision the format document to indicate that the RLE encoding is only used for boolean values, dictionary indices (when using dictionary encoding, which is most of the time), and the repetition and definition levels. - Wes On Wed, Dec 6, 2017 at 8:46 PM, Tim Armstrong <[email protected]> wrote: > The current RLE coding has bit-packing baked into it, so I'm wondering what > it even means to bit-pack a lot of the types, particularly if you don't > have bounds on the range of values. > > I can see if you have a logic int8 column stored in an int32, you have > bounds on the values, so bit-packing would let you pack things more densely > > But if you have a int64 column, do you just store the 64 bit values > back-to-back? Is that different from the plain encoding? Or do you select a > bitwidth per page and store that in the page header? > > We also can't bit-pack types like strings at all. > > I guess based on that and Ryan's observation about negative numbers, it > sounds like getting a quality RLE encoding for isn't a trivial extension of > the current encoding and needs some thought. > > > On Wed, Dec 6, 2017 at 2:33 PM, Ryan Blue <[email protected]> wrote: > >> There isn't anything that I know of that would prevent this from working. I >> think the Java library would even read the data successfully because it >> allows pages (usually dictionary-encoded ones) to be RLE encoded. >> >> The main problem with this is that the RLE encoding is unaware of negative >> values. Any negative number causes the entire data page to be stored with >> plain encoding because the most-significant bit is set. So there's just no >> benefit to doing it. >> >> The fact that we don't have an encoding that takes advantage of smaller >> widths is why I proposed a variant of the RLE codec a while back. >> Basically, it makes all numbers positive by zig-zag encoding (moving the >> sign bit to the lsb) and then allows the RLE encoding to change packing >> width with an extra byte. I think this would be a good one to add for v2, >> but this is obviously a separate issue. >> >> rb >> >> On Wed, Dec 6, 2017 at 1:58 PM, Wes McKinney <[email protected]> wrote: >> >> > Sorry, to clarify, in this question: >> > >> > >> > 1) Was RLE (the Hybrid-bitpacked RLE encoder used for >> > repetition/definition levels) ever intended for use for encoding data >> > pages in the Parquet V1 format? >> > >> > I meant for encoding data pages that do not contain dictionary indices >> > (i.e. as an alternative to PLAIN or PLAIN_DICTIONARY/RLE_DICTIONARY) >> > >> > On Wed, Dec 6, 2017 at 4:53 PM, Wes McKinney <[email protected]> >> wrote: >> > > We had a discussion recently [1] in which a Python implementation of >> > > Parquet had used the RLE encoding type for encoding the data pages for >> > > INT32 values with UINT_8 logical type (non dictionary-encoded). >> > > >> > > In the Encodings.md document [3] in the Parquet format, it is not >> > > strictly indicated that the RLE encoding is to be used for >> > > definition/repetition levels and boolean, though that is all that is >> > > supported in parquet-mr [4], parquet-cpp, Impala [5], and other >> > > implementations. >> > > >> > > So questions: >> > > >> > > 1) Was RLE (the Hybrid-bitpacked RLE encoder used for >> > > repetition/definition levels) ever intended for use for encoding data >> > > pages in the Parquet V1 format? >> > > >> > > 2) Whether yes or no, should we update apache/parquet-format to be >> > > more explicit about the purpose and scope of this encoding? >> > > >> > > Thanks, >> > > Wes >> > > >> > > [1]: https://github.com/dask/fastparquet/issues/256 >> > > [2]: https://github.com/dask/fastparquet >> > > [3]: https://github.com/apache/parquet-format/blob/master/Encodings.md >> > > [4]: https://github.com/apache/parquet-mr/blob/master/ >> > parquet-column/src/main/java/org/apache/parquet/column/ >> Encoding.java#L115 >> > > [5]: https://github.com/apache/impala/blob/master/be/src/ >> > exec/parquet-column-readers.cc#L495 >> > >> >> >> >> -- >> Ryan Blue >> Software Engineer >> Netflix >>
