Re: Clarifying valid uses for RLE encoding type

Tim Armstrong Thu, 07 Dec 2017 11:55:32 -0800

FWIW Impala doesn't support RLE-encoded booleans but it seems like a
reasonable extension. I'm not sure if other readers support that too in
practice at the moment.


On Wed, Dec 6, 2017 at 6:19 PM, Wes McKinney <[email protected]> wrote:

> I think the issue is that in the library (dask/fastparquet) where this
> came up, dictionary encoding in general has not been implemented. So
> for unsigned 8-bit integer, since you can use RLE with bit width 8 to
> encode such data, this is being used as an alternative to PLAIN
> encoding. But since UINT_8 is only a logical type the annotates INT32,
> the RLE encoding as it's defined now cannot be used in general to
> encode INT32.
>
> I would suggest that we make a minor revision the format document to
> indicate that the RLE encoding is only used for boolean values,
> dictionary indices (when using dictionary encoding, which is most of
> the time), and the repetition and definition levels.
>
> - Wes
>
> On Wed, Dec 6, 2017 at 8:46 PM, Tim Armstrong <[email protected]>
> wrote:
> > The current RLE coding has bit-packing baked into it, so I'm wondering
> what
> > it even means to bit-pack a lot of the types, particularly if you don't
> > have bounds on the range of values.
> >
> > I can see if you have a logic int8 column stored in an int32, you have
> > bounds on the values, so bit-packing would let you pack things more
> densely
> >
> > But if you have a int64 column, do you just store the 64 bit values
> > back-to-back? Is that different from the plain encoding? Or do you
> select a
> > bitwidth per page and store that in the page header?
> >
> > We also can't bit-pack types like strings at all.
> >
> > I guess based on that and Ryan's observation about negative numbers, it
> > sounds like getting a quality RLE encoding for isn't a trivial extension
> of
> > the current encoding and needs some thought.
> >
> >
> > On Wed, Dec 6, 2017 at 2:33 PM, Ryan Blue <[email protected]>
> wrote:
> >
> >> There isn't anything that I know of that would prevent this from
> working. I
> >> think the Java library would even read the data successfully because it
> >> allows pages (usually dictionary-encoded ones) to be RLE encoded.
> >>
> >> The main problem with this is that the RLE encoding is unaware of
> negative
> >> values. Any negative number causes the entire data page to be stored
> with
> >> plain encoding because the most-significant bit is set. So there's just
> no
> >> benefit to doing it.
> >>
> >> The fact that we don't have an encoding that takes advantage of smaller
> >> widths is why I proposed a variant of the RLE codec a while back.
> >> Basically, it makes all numbers positive by zig-zag encoding (moving the
> >> sign bit to the lsb) and then allows the RLE encoding to change packing
> >> width with an extra byte. I think this would be a good one to add for
> v2,
> >> but this is obviously a separate issue.
> >>
> >> rb
> >>
> >> On Wed, Dec 6, 2017 at 1:58 PM, Wes McKinney <[email protected]>
> wrote:
> >>
> >> > Sorry, to clarify, in this question:
> >> >
> >> >
> >> > 1) Was RLE (the Hybrid-bitpacked RLE encoder used for
> >> > repetition/definition levels) ever intended for use for encoding data
> >> > pages in the Parquet V1 format?
> >> >
> >> > I meant for encoding data pages that do not contain dictionary indices
> >> > (i.e. as an alternative to PLAIN or PLAIN_DICTIONARY/RLE_DICTIONARY)
> >> >
> >> > On Wed, Dec 6, 2017 at 4:53 PM, Wes McKinney <[email protected]>
> >> wrote:
> >> > > We had a discussion recently [1] in which a Python implementation of
> >> > > Parquet had used the RLE encoding type for encoding the data pages
> for
> >> > > INT32 values with UINT_8 logical type (non dictionary-encoded).
> >> > >
> >> > > In the Encodings.md document [3] in the Parquet format, it is not
> >> > > strictly indicated that the RLE encoding is to be used for
> >> > > definition/repetition levels and boolean, though that is all that is
> >> > > supported in parquet-mr [4], parquet-cpp, Impala [5], and other
> >> > > implementations.
> >> > >
> >> > > So questions:
> >> > >
> >> > > 1) Was RLE (the Hybrid-bitpacked RLE encoder used for
> >> > > repetition/definition levels) ever intended for use for encoding
> data
> >> > > pages in the Parquet V1 format?
> >> > >
> >> > > 2) Whether yes or no, should we update apache/parquet-format to be
> >> > > more explicit about the purpose and scope of this encoding?
> >> > >
> >> > > Thanks,
> >> > > Wes
> >> > >
> >> > > [1]: https://github.com/dask/fastparquet/issues/256
> >> > > [2]: https://github.com/dask/fastparquet
> >> > > [3]: https://github.com/apache/parquet-format/blob/master/
> Encodings.md
> >> > > [4]: https://github.com/apache/parquet-mr/blob/master/
> >> > parquet-column/src/main/java/org/apache/parquet/column/
> >> Encoding.java#L115
> >> > > [5]: https://github.com/apache/impala/blob/master/be/src/
> >> > exec/parquet-column-readers.cc#L495
> >> >
> >>
> >>
> >>
> >> --
> >> Ryan Blue
> >> Software Engineer
> >> Netflix
> >>
>

Re: Clarifying valid uses for RLE encoding type

Reply via email to