FWIW Impala doesn't support RLE-encoded booleans but it seems like a reasonable extension. I'm not sure if other readers support that too in practice at the moment.
On Wed, Dec 6, 2017 at 6:19 PM, Wes McKinney <[email protected]> wrote: > I think the issue is that in the library (dask/fastparquet) where this > came up, dictionary encoding in general has not been implemented. So > for unsigned 8-bit integer, since you can use RLE with bit width 8 to > encode such data, this is being used as an alternative to PLAIN > encoding. But since UINT_8 is only a logical type the annotates INT32, > the RLE encoding as it's defined now cannot be used in general to > encode INT32. > > I would suggest that we make a minor revision the format document to > indicate that the RLE encoding is only used for boolean values, > dictionary indices (when using dictionary encoding, which is most of > the time), and the repetition and definition levels. > > - Wes > > On Wed, Dec 6, 2017 at 8:46 PM, Tim Armstrong <[email protected]> > wrote: > > The current RLE coding has bit-packing baked into it, so I'm wondering > what > > it even means to bit-pack a lot of the types, particularly if you don't > > have bounds on the range of values. > > > > I can see if you have a logic int8 column stored in an int32, you have > > bounds on the values, so bit-packing would let you pack things more > densely > > > > But if you have a int64 column, do you just store the 64 bit values > > back-to-back? Is that different from the plain encoding? Or do you > select a > > bitwidth per page and store that in the page header? > > > > We also can't bit-pack types like strings at all. > > > > I guess based on that and Ryan's observation about negative numbers, it > > sounds like getting a quality RLE encoding for isn't a trivial extension > of > > the current encoding and needs some thought. > > > > > > On Wed, Dec 6, 2017 at 2:33 PM, Ryan Blue <[email protected]> > wrote: > > > >> There isn't anything that I know of that would prevent this from > working. I > >> think the Java library would even read the data successfully because it > >> allows pages (usually dictionary-encoded ones) to be RLE encoded. > >> > >> The main problem with this is that the RLE encoding is unaware of > negative > >> values. Any negative number causes the entire data page to be stored > with > >> plain encoding because the most-significant bit is set. So there's just > no > >> benefit to doing it. > >> > >> The fact that we don't have an encoding that takes advantage of smaller > >> widths is why I proposed a variant of the RLE codec a while back. > >> Basically, it makes all numbers positive by zig-zag encoding (moving the > >> sign bit to the lsb) and then allows the RLE encoding to change packing > >> width with an extra byte. I think this would be a good one to add for > v2, > >> but this is obviously a separate issue. > >> > >> rb > >> > >> On Wed, Dec 6, 2017 at 1:58 PM, Wes McKinney <[email protected]> > wrote: > >> > >> > Sorry, to clarify, in this question: > >> > > >> > > >> > 1) Was RLE (the Hybrid-bitpacked RLE encoder used for > >> > repetition/definition levels) ever intended for use for encoding data > >> > pages in the Parquet V1 format? > >> > > >> > I meant for encoding data pages that do not contain dictionary indices > >> > (i.e. as an alternative to PLAIN or PLAIN_DICTIONARY/RLE_DICTIONARY) > >> > > >> > On Wed, Dec 6, 2017 at 4:53 PM, Wes McKinney <[email protected]> > >> wrote: > >> > > We had a discussion recently [1] in which a Python implementation of > >> > > Parquet had used the RLE encoding type for encoding the data pages > for > >> > > INT32 values with UINT_8 logical type (non dictionary-encoded). > >> > > > >> > > In the Encodings.md document [3] in the Parquet format, it is not > >> > > strictly indicated that the RLE encoding is to be used for > >> > > definition/repetition levels and boolean, though that is all that is > >> > > supported in parquet-mr [4], parquet-cpp, Impala [5], and other > >> > > implementations. > >> > > > >> > > So questions: > >> > > > >> > > 1) Was RLE (the Hybrid-bitpacked RLE encoder used for > >> > > repetition/definition levels) ever intended for use for encoding > data > >> > > pages in the Parquet V1 format? > >> > > > >> > > 2) Whether yes or no, should we update apache/parquet-format to be > >> > > more explicit about the purpose and scope of this encoding? > >> > > > >> > > Thanks, > >> > > Wes > >> > > > >> > > [1]: https://github.com/dask/fastparquet/issues/256 > >> > > [2]: https://github.com/dask/fastparquet > >> > > [3]: https://github.com/apache/parquet-format/blob/master/ > Encodings.md > >> > > [4]: https://github.com/apache/parquet-mr/blob/master/ > >> > parquet-column/src/main/java/org/apache/parquet/column/ > >> Encoding.java#L115 > >> > > [5]: https://github.com/apache/impala/blob/master/be/src/ > >> > exec/parquet-column-readers.cc#L495 > >> > > >> > >> > >> > >> -- > >> Ryan Blue > >> Software Engineer > >> Netflix > >> >
