Re: Clarifying valid uses for RLE encoding type

Ryan Blue Wed, 13 Dec 2017 10:46:29 -0800

Sounds good to me.

On Wed, Dec 13, 2017 at 10:38 AM, Wes McKinney <[email protected]> wrote:


> Alright, so what I've taken from this thread is that we should make a
> small amendment to the format documents to indicate valid uses for the
> RLE encoding -- at least as far as Parquet format V1 is concerned.
> These consist of:
>
> * Repetition and definition levels
> * Indices in dictionary pages
> * Boolean values
>
> This would help resolve the confusion that we experienced in
> https://github.com/dask/fastparquet/issues/256. Does that sound right?
>
> Thanks
> Wes
>
> On Mon, Dec 11, 2017 at 12:52 PM, Ryan Blue <[email protected]> wrote:
> >> to do this properly, the writer should be able to specify the bitwidth
> >> explicitly per-page
> >
> > In the encoders I proposed to do this, I added a byte to bit-packed runs
> > that encodes the width:
> > https://github.com/rdblue/parquet-mr/blob/encoders/
> parquet-column/src/main/java/org/apache/parquet/column/values/zigzag/
> VariableWidthRLEEncoder.java#L197
> >
> > Adding a byte also allows the width to change within a page. It isn't as
> > effective as patching for large values, but there are a couple benefits.
> > First, you don't use the largest width for all values if you have just
> one
> > large one. Second, you don't have to keep an entire page worth of values
> in
> > memory before encoding because you can widen the bit packed runs and
> write
> > incrementally.
> >
> > rb
> >
> > On Fri, Dec 8, 2017 at 11:40 AM, Tim Armstrong <[email protected]>
> > wrote:
> >>
> >> Ah I see - that's definitely not part of the format then, since it
> >> requires the reader and writer to agree on the algorithm for deciding
> >> bitwidth, and there's no mention of that in the format. It seems like
> to do
> >> this properly, the writer should be able to specify the bitwidth
> explicitly
> >> per-page to also handle cases where the actual values encoded do not
> need
> >> the full bitwidth implied by the type.
> >>
> >> On Fri, Dec 8, 2017 at 11:26 AM, Wes McKinney <[email protected]>
> wrote:
> >>>
> >>> In the case where this arose, the developer had used the UINT_8
> >>> ConvertedType to imply a bit width of 8.
> >>>
> >>> On Thu, Dec 7, 2017 at 6:53 PM, Ryan Blue <[email protected]>
> >>> wrote:
> >>> > Good point. For Parquet Java this is always passed in. I guess this
> is
> >>> > using the type's maximum width? If so, I don't think this would be
> >>> > readable
> >>> > by other Parquet implementations because there is no place to store
> the
> >>> > bit
> >>> > width.
> >>> >
> >>> > On Thu, Dec 7, 2017 at 3:48 PM, Tim Armstrong <
> [email protected]>
> >>> > wrote:
> >>> >
> >>> >> > Using the RLE encoding will be different from the plain encoding
> >>> >> > because
> >>> >> you'd have the overhead bytes for runs and packed sections. We would
> >>> >> still
> >>> >> pack int64 values using the width, which is a required parameter.
> >>> >> How would a reader determine the bit width though? I can't see
> >>> >> anywhere in
> >>> >> the format where the bit width is explicitly set. For the RLE level
> >>> >> decoding it's implied by the max rep/def level.
> >>> >>
> >>> >> On Thu, Dec 7, 2017 at 3:31 PM, Ryan Blue <[email protected]>
> wrote:
> >>> >>
> >>> >>> > But if you have a int64 column, do you just store the 64 bit
> values
> >>> >>> back-to-back? Is that different from the plain encoding?
> >>> >>>
> >>> >>> Using the RLE encoding will be different from the plan encoding
> >>> >>> because
> >>> >>> you'd have the overhead bytes for runs and packed sections. We
> would
> >>> >>> still
> >>> >>> pack int64 values using the width, which is a required parameter.
> >>> >>>
> >>> >>> > I would suggest that we make a minor revision the format document
> >>> >>> > to
> >>> >>> indicate that the RLE encoding is only used for boolean values,
> >>> >>> dictionary
> >>> >>> indices (when using dictionary encoding, which is most of the
> time),
> >>> >>> and
> >>> >>> the repetition and definition levels.
> >>> >>>
> >>> >>> Unsigned, small integers are actually a good case for using RLE
> >>> >>> codecs.
> >>> >>> If you can guarantee that you won't have the msb set unless the
> >>> >>> number
> >>> >>> really is large, then why not allow people to use them?
> >>> >>>
> >>> >>> rb
> >>> >>>
> >>> >>> On Thu, Dec 7, 2017 at 11:33 AM, Tim Armstrong
> >>> >>> <[email protected]>
> >>> >>> wrote:
> >>> >>>
> >>> >>>> FWIW Impala doesn't support RLE-encoded booleans but it seems
> like a
> >>> >>>> reasonable extension. I'm not sure if other readers support that
> too
> >>> >>>> in
> >>> >>>> practice at the moment.
> >>> >>>>
> >>> >>>> On Wed, Dec 6, 2017 at 6:19 PM, Wes McKinney <[email protected]
> >
> >>> >>>> wrote:
> >>> >>>>
> >>> >>>>> I think the issue is that in the library (dask/fastparquet) where
> >>> >>>>> this
> >>> >>>>> came up, dictionary encoding in general has not been implemented.
> >>> >>>>> So
> >>> >>>>> for unsigned 8-bit integer, since you can use RLE with bit width
> 8
> >>> >>>>> to
> >>> >>>>> encode such data, this is being used as an alternative to PLAIN
> >>> >>>>> encoding. But since UINT_8 is only a logical type the annotates
> >>> >>>>> INT32,
> >>> >>>>> the RLE encoding as it's defined now cannot be used in general to
> >>> >>>>> encode INT32.
> >>> >>>>>
> >>> >>>>> I would suggest that we make a minor revision the format document
> >>> >>>>> to
> >>> >>>>> indicate that the RLE encoding is only used for boolean values,
> >>> >>>>> dictionary indices (when using dictionary encoding, which is most
> >>> >>>>> of
> >>> >>>>> the time), and the repetition and definition levels.
> >>> >>>>>
> >>> >>>>> - Wes
> >>> >>>>>
> >>> >>>>> On Wed, Dec 6, 2017 at 8:46 PM, Tim Armstrong
> >>> >>>>> <[email protected]>
> >>> >>>>> wrote:
> >>> >>>>> > The current RLE coding has bit-packing baked into it, so I'm
> >>> >>>>> wondering what
> >>> >>>>> > it even means to bit-pack a lot of the types, particularly if
> you
> >>> >>>>> don't
> >>> >>>>> > have bounds on the range of values.
> >>> >>>>> >
> >>> >>>>> > I can see if you have a logic int8 column stored in an int32,
> you
> >>> >>>>> > have
> >>> >>>>> > bounds on the values, so bit-packing would let you pack things
> >>> >>>>> > more
> >>> >>>>> densely
> >>> >>>>> >
> >>> >>>>> > But if you have a int64 column, do you just store the 64 bit
> >>> >>>>> > values
> >>> >>>>> > back-to-back? Is that different from the plain encoding? Or do
> >>> >>>>> > you
> >>> >>>>> select a
> >>> >>>>> > bitwidth per page and store that in the page header?
> >>> >>>>> >
> >>> >>>>> > We also can't bit-pack types like strings at all.
> >>> >>>>> >
> >>> >>>>> > I guess based on that and Ryan's observation about negative
> >>> >>>>> > numbers,
> >>> >>>>> it
> >>> >>>>> > sounds like getting a quality RLE encoding for isn't a trivial
> >>> >>>>> extension of
> >>> >>>>> > the current encoding and needs some thought.
> >>> >>>>> >
> >>> >>>>> >
> >>> >>>>> > On Wed, Dec 6, 2017 at 2:33 PM, Ryan Blue
> >>> >>>>> > <[email protected]>
> >>> >>>>> wrote:
> >>> >>>>> >
> >>> >>>>> >> There isn't anything that I know of that would prevent this
> from
> >>> >>>>> working. I
> >>> >>>>> >> think the Java library would even read the data successfully
> >>> >>>>> >> because
> >>> >>>>> it
> >>> >>>>> >> allows pages (usually dictionary-encoded ones) to be RLE
> >>> >>>>> >> encoded.
> >>> >>>>> >>
> >>> >>>>> >> The main problem with this is that the RLE encoding is unaware
> >>> >>>>> >> of
> >>> >>>>> negative
> >>> >>>>> >> values. Any negative number causes the entire data page to be
> >>> >>>>> >> stored
> >>> >>>>> with
> >>> >>>>> >> plain encoding because the most-significant bit is set. So
> >>> >>>>> >> there's
> >>> >>>>> just no
> >>> >>>>> >> benefit to doing it.
> >>> >>>>> >>
> >>> >>>>> >> The fact that we don't have an encoding that takes advantage
> of
> >>> >>>>> smaller
> >>> >>>>> >> widths is why I proposed a variant of the RLE codec a while
> >>> >>>>> >> back.
> >>> >>>>> >> Basically, it makes all numbers positive by zig-zag encoding
> >>> >>>>> >> (moving
> >>> >>>>> the
> >>> >>>>> >> sign bit to the lsb) and then allows the RLE encoding to
> change
> >>> >>>>> packing
> >>> >>>>> >> width with an extra byte. I think this would be a good one to
> >>> >>>>> >> add
> >>> >>>>> for v2,
> >>> >>>>> >> but this is obviously a separate issue.
> >>> >>>>> >>
> >>> >>>>> >> rb
> >>> >>>>> >>
> >>> >>>>> >> On Wed, Dec 6, 2017 at 1:58 PM, Wes McKinney
> >>> >>>>> >> <[email protected]>
> >>> >>>>> wrote:
> >>> >>>>> >>
> >>> >>>>> >> > Sorry, to clarify, in this question:
> >>> >>>>> >> >
> >>> >>>>> >> >
> >>> >>>>> >> > 1) Was RLE (the Hybrid-bitpacked RLE encoder used for
> >>> >>>>> >> > repetition/definition levels) ever intended for use for
> >>> >>>>> >> > encoding
> >>> >>>>> data
> >>> >>>>> >> > pages in the Parquet V1 format?
> >>> >>>>> >> >
> >>> >>>>> >> > I meant for encoding data pages that do not contain
> dictionary
> >>> >>>>> indices
> >>> >>>>> >> > (i.e. as an alternative to PLAIN or
> >>> >>>>> >> > PLAIN_DICTIONARY/RLE_DICTIONAR
> >>> >>>>> Y)
> >>> >>>>> >> >
> >>> >>>>> >> > On Wed, Dec 6, 2017 at 4:53 PM, Wes McKinney
> >>> >>>>> >> > <[email protected]>
> >>> >>>>> >> wrote:
> >>> >>>>> >> > > We had a discussion recently [1] in which a Python
> >>> >>>>> implementation of
> >>> >>>>> >> > > Parquet had used the RLE encoding type for encoding the
> data
> >>> >>>>> pages for
> >>> >>>>> >> > > INT32 values with UINT_8 logical type (non
> >>> >>>>> >> > > dictionary-encoded).
> >>> >>>>> >> > >
> >>> >>>>> >> > > In the Encodings.md document [3] in the Parquet format, it
> >>> >>>>> >> > > is not
> >>> >>>>> >> > > strictly indicated that the RLE encoding is to be used for
> >>> >>>>> >> > > definition/repetition levels and boolean, though that is
> all
> >>> >>>>> that is
> >>> >>>>> >> > > supported in parquet-mr [4], parquet-cpp, Impala [5], and
> >>> >>>>> >> > > other
> >>> >>>>> >> > > implementations.
> >>> >>>>> >> > >
> >>> >>>>> >> > > So questions:
> >>> >>>>> >> > >
> >>> >>>>> >> > > 1) Was RLE (the Hybrid-bitpacked RLE encoder used for
> >>> >>>>> >> > > repetition/definition levels) ever intended for use for
> >>> >>>>> >> > > encoding
> >>> >>>>> data
> >>> >>>>> >> > > pages in the Parquet V1 format?
> >>> >>>>> >> > >
> >>> >>>>> >> > > 2) Whether yes or no, should we update
> apache/parquet-format
> >>> >>>>> >> > > to
> >>> >>>>> be
> >>> >>>>> >> > > more explicit about the purpose and scope of this
> encoding?
> >>> >>>>> >> > >
> >>> >>>>> >> > > Thanks,
> >>> >>>>> >> > > Wes
> >>> >>>>> >> > >
> >>> >>>>> >> > > [1]: https://github.com/dask/fastparquet/issues/256
> >>> >>>>> >> > > [2]: https://github.com/dask/fastparquet
> >>> >>>>> >> > > [3]: https://github.com/apache/parq
> >>> >>>>> uet-format/blob/master/Encodings.md
> >>> >>>>> >> > > [4]: https://github.com/apache/parquet-mr/blob/master/
> >>> >>>>> >> > parquet-column/src/main/java/org/apache/parquet/column/
> >>> >>>>> >> Encoding.java#L115
> >>> >>>>> >> > > [5]: https://github.com/apache/impala/blob/master/be/src/
> >>> >>>>> >> > exec/parquet-column-readers.cc#L495
> >>> >>>>> >> >
> >>> >>>>> >>
> >>> >>>>> >>
> >>> >>>>> >>
> >>> >>>>> >> --
> >>> >>>>> >> Ryan Blue
> >>> >>>>> >> Software Engineer
> >>> >>>>> >> Netflix
> >>> >>>>> >>
> >>> >>>>>
> >>> >>>>
> >>> >>>>
> >>> >>>
> >>> >>>
> >>> >>> --
> >>> >>> Ryan Blue
> >>> >>> Software Engineer
> >>> >>> Netflix
> >>> >>>
> >>> >>
> >>> >>
> >>> >
> >>> >
> >>> > --
> >>> > Ryan Blue
> >>> > Software Engineer
> >>> > Netflix
> >>
> >>
> >
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
>



-- 
Ryan Blue
Software Engineer
Netflix

Re: Clarifying valid uses for RLE encoding type

Reply via email to