Re: Clarifying valid uses for RLE encoding type

Wes McKinney Wed, 13 Dec 2017 10:39:42 -0800

Alright, so what I've taken from this thread is that we should make a
small amendment to the format documents to indicate valid uses for the
RLE encoding -- at least as far as Parquet format V1 is concerned.
These consist of:


* Repetition and definition levels
* Indices in dictionary pages
* Boolean values

This would help resolve the confusion that we experienced in
https://github.com/dask/fastparquet/issues/256. Does that sound right?

Thanks
Wes

On Mon, Dec 11, 2017 at 12:52 PM, Ryan Blue <[email protected]> wrote:
>> to do this properly, the writer should be able to specify the bitwidth
>> explicitly per-page
>
> In the encoders I proposed to do this, I added a byte to bit-packed runs
> that encodes the width:
> https://github.com/rdblue/parquet-mr/blob/encoders/parquet-column/src/main/java/org/apache/parquet/column/values/zigzag/VariableWidthRLEEncoder.java#L197
>
> Adding a byte also allows the width to change within a page. It isn't as
> effective as patching for large values, but there are a couple benefits.
> First, you don't use the largest width for all values if you have just one
> large one. Second, you don't have to keep an entire page worth of values in
> memory before encoding because you can widen the bit packed runs and write
> incrementally.
>
> rb
>
> On Fri, Dec 8, 2017 at 11:40 AM, Tim Armstrong <[email protected]>
> wrote:
>>
>> Ah I see - that's definitely not part of the format then, since it
>> requires the reader and writer to agree on the algorithm for deciding
>> bitwidth, and there's no mention of that in the format. It seems like to do
>> this properly, the writer should be able to specify the bitwidth explicitly
>> per-page to also handle cases where the actual values encoded do not need
>> the full bitwidth implied by the type.
>>
>> On Fri, Dec 8, 2017 at 11:26 AM, Wes McKinney <[email protected]> wrote:
>>>
>>> In the case where this arose, the developer had used the UINT_8
>>> ConvertedType to imply a bit width of 8.
>>>
>>> On Thu, Dec 7, 2017 at 6:53 PM, Ryan Blue <[email protected]>
>>> wrote:
>>> > Good point. For Parquet Java this is always passed in. I guess this is
>>> > using the type's maximum width? If so, I don't think this would be
>>> > readable
>>> > by other Parquet implementations because there is no place to store the
>>> > bit
>>> > width.
>>> >
>>> > On Thu, Dec 7, 2017 at 3:48 PM, Tim Armstrong <[email protected]>
>>> > wrote:
>>> >
>>> >> > Using the RLE encoding will be different from the plain encoding
>>> >> > because
>>> >> you'd have the overhead bytes for runs and packed sections. We would
>>> >> still
>>> >> pack int64 values using the width, which is a required parameter.
>>> >> How would a reader determine the bit width though? I can't see
>>> >> anywhere in
>>> >> the format where the bit width is explicitly set. For the RLE level
>>> >> decoding it's implied by the max rep/def level.
>>> >>
>>> >> On Thu, Dec 7, 2017 at 3:31 PM, Ryan Blue <[email protected]> wrote:
>>> >>
>>> >>> > But if you have a int64 column, do you just store the 64 bit values
>>> >>> back-to-back? Is that different from the plain encoding?
>>> >>>
>>> >>> Using the RLE encoding will be different from the plan encoding
>>> >>> because
>>> >>> you'd have the overhead bytes for runs and packed sections. We would
>>> >>> still
>>> >>> pack int64 values using the width, which is a required parameter.
>>> >>>
>>> >>> > I would suggest that we make a minor revision the format document
>>> >>> > to
>>> >>> indicate that the RLE encoding is only used for boolean values,
>>> >>> dictionary
>>> >>> indices (when using dictionary encoding, which is most of the time),
>>> >>> and
>>> >>> the repetition and definition levels.
>>> >>>
>>> >>> Unsigned, small integers are actually a good case for using RLE
>>> >>> codecs.
>>> >>> If you can guarantee that you won't have the msb set unless the
>>> >>> number
>>> >>> really is large, then why not allow people to use them?
>>> >>>
>>> >>> rb
>>> >>>
>>> >>> On Thu, Dec 7, 2017 at 11:33 AM, Tim Armstrong
>>> >>> <[email protected]>
>>> >>> wrote:
>>> >>>
>>> >>>> FWIW Impala doesn't support RLE-encoded booleans but it seems like a
>>> >>>> reasonable extension. I'm not sure if other readers support that too
>>> >>>> in
>>> >>>> practice at the moment.
>>> >>>>
>>> >>>> On Wed, Dec 6, 2017 at 6:19 PM, Wes McKinney <[email protected]>
>>> >>>> wrote:
>>> >>>>
>>> >>>>> I think the issue is that in the library (dask/fastparquet) where
>>> >>>>> this
>>> >>>>> came up, dictionary encoding in general has not been implemented.
>>> >>>>> So
>>> >>>>> for unsigned 8-bit integer, since you can use RLE with bit width 8
>>> >>>>> to
>>> >>>>> encode such data, this is being used as an alternative to PLAIN
>>> >>>>> encoding. But since UINT_8 is only a logical type the annotates
>>> >>>>> INT32,
>>> >>>>> the RLE encoding as it's defined now cannot be used in general to
>>> >>>>> encode INT32.
>>> >>>>>
>>> >>>>> I would suggest that we make a minor revision the format document
>>> >>>>> to
>>> >>>>> indicate that the RLE encoding is only used for boolean values,
>>> >>>>> dictionary indices (when using dictionary encoding, which is most
>>> >>>>> of
>>> >>>>> the time), and the repetition and definition levels.
>>> >>>>>
>>> >>>>> - Wes
>>> >>>>>
>>> >>>>> On Wed, Dec 6, 2017 at 8:46 PM, Tim Armstrong
>>> >>>>> <[email protected]>
>>> >>>>> wrote:
>>> >>>>> > The current RLE coding has bit-packing baked into it, so I'm
>>> >>>>> wondering what
>>> >>>>> > it even means to bit-pack a lot of the types, particularly if you
>>> >>>>> don't
>>> >>>>> > have bounds on the range of values.
>>> >>>>> >
>>> >>>>> > I can see if you have a logic int8 column stored in an int32, you
>>> >>>>> > have
>>> >>>>> > bounds on the values, so bit-packing would let you pack things
>>> >>>>> > more
>>> >>>>> densely
>>> >>>>> >
>>> >>>>> > But if you have a int64 column, do you just store the 64 bit
>>> >>>>> > values
>>> >>>>> > back-to-back? Is that different from the plain encoding? Or do
>>> >>>>> > you
>>> >>>>> select a
>>> >>>>> > bitwidth per page and store that in the page header?
>>> >>>>> >
>>> >>>>> > We also can't bit-pack types like strings at all.
>>> >>>>> >
>>> >>>>> > I guess based on that and Ryan's observation about negative
>>> >>>>> > numbers,
>>> >>>>> it
>>> >>>>> > sounds like getting a quality RLE encoding for isn't a trivial
>>> >>>>> extension of
>>> >>>>> > the current encoding and needs some thought.
>>> >>>>> >
>>> >>>>> >
>>> >>>>> > On Wed, Dec 6, 2017 at 2:33 PM, Ryan Blue
>>> >>>>> > <[email protected]>
>>> >>>>> wrote:
>>> >>>>> >
>>> >>>>> >> There isn't anything that I know of that would prevent this from
>>> >>>>> working. I
>>> >>>>> >> think the Java library would even read the data successfully
>>> >>>>> >> because
>>> >>>>> it
>>> >>>>> >> allows pages (usually dictionary-encoded ones) to be RLE
>>> >>>>> >> encoded.
>>> >>>>> >>
>>> >>>>> >> The main problem with this is that the RLE encoding is unaware
>>> >>>>> >> of
>>> >>>>> negative
>>> >>>>> >> values. Any negative number causes the entire data page to be
>>> >>>>> >> stored
>>> >>>>> with
>>> >>>>> >> plain encoding because the most-significant bit is set. So
>>> >>>>> >> there's
>>> >>>>> just no
>>> >>>>> >> benefit to doing it.
>>> >>>>> >>
>>> >>>>> >> The fact that we don't have an encoding that takes advantage of
>>> >>>>> smaller
>>> >>>>> >> widths is why I proposed a variant of the RLE codec a while
>>> >>>>> >> back.
>>> >>>>> >> Basically, it makes all numbers positive by zig-zag encoding
>>> >>>>> >> (moving
>>> >>>>> the
>>> >>>>> >> sign bit to the lsb) and then allows the RLE encoding to change
>>> >>>>> packing
>>> >>>>> >> width with an extra byte. I think this would be a good one to
>>> >>>>> >> add
>>> >>>>> for v2,
>>> >>>>> >> but this is obviously a separate issue.
>>> >>>>> >>
>>> >>>>> >> rb
>>> >>>>> >>
>>> >>>>> >> On Wed, Dec 6, 2017 at 1:58 PM, Wes McKinney
>>> >>>>> >> <[email protected]>
>>> >>>>> wrote:
>>> >>>>> >>
>>> >>>>> >> > Sorry, to clarify, in this question:
>>> >>>>> >> >
>>> >>>>> >> >
>>> >>>>> >> > 1) Was RLE (the Hybrid-bitpacked RLE encoder used for
>>> >>>>> >> > repetition/definition levels) ever intended for use for
>>> >>>>> >> > encoding
>>> >>>>> data
>>> >>>>> >> > pages in the Parquet V1 format?
>>> >>>>> >> >
>>> >>>>> >> > I meant for encoding data pages that do not contain dictionary
>>> >>>>> indices
>>> >>>>> >> > (i.e. as an alternative to PLAIN or
>>> >>>>> >> > PLAIN_DICTIONARY/RLE_DICTIONAR
>>> >>>>> Y)
>>> >>>>> >> >
>>> >>>>> >> > On Wed, Dec 6, 2017 at 4:53 PM, Wes McKinney
>>> >>>>> >> > <[email protected]>
>>> >>>>> >> wrote:
>>> >>>>> >> > > We had a discussion recently [1] in which a Python
>>> >>>>> implementation of
>>> >>>>> >> > > Parquet had used the RLE encoding type for encoding the data
>>> >>>>> pages for
>>> >>>>> >> > > INT32 values with UINT_8 logical type (non
>>> >>>>> >> > > dictionary-encoded).
>>> >>>>> >> > >
>>> >>>>> >> > > In the Encodings.md document [3] in the Parquet format, it
>>> >>>>> >> > > is not
>>> >>>>> >> > > strictly indicated that the RLE encoding is to be used for
>>> >>>>> >> > > definition/repetition levels and boolean, though that is all
>>> >>>>> that is
>>> >>>>> >> > > supported in parquet-mr [4], parquet-cpp, Impala [5], and
>>> >>>>> >> > > other
>>> >>>>> >> > > implementations.
>>> >>>>> >> > >
>>> >>>>> >> > > So questions:
>>> >>>>> >> > >
>>> >>>>> >> > > 1) Was RLE (the Hybrid-bitpacked RLE encoder used for
>>> >>>>> >> > > repetition/definition levels) ever intended for use for
>>> >>>>> >> > > encoding
>>> >>>>> data
>>> >>>>> >> > > pages in the Parquet V1 format?
>>> >>>>> >> > >
>>> >>>>> >> > > 2) Whether yes or no, should we update apache/parquet-format
>>> >>>>> >> > > to
>>> >>>>> be
>>> >>>>> >> > > more explicit about the purpose and scope of this encoding?
>>> >>>>> >> > >
>>> >>>>> >> > > Thanks,
>>> >>>>> >> > > Wes
>>> >>>>> >> > >
>>> >>>>> >> > > [1]: https://github.com/dask/fastparquet/issues/256
>>> >>>>> >> > > [2]: https://github.com/dask/fastparquet
>>> >>>>> >> > > [3]: https://github.com/apache/parq
>>> >>>>> uet-format/blob/master/Encodings.md
>>> >>>>> >> > > [4]: https://github.com/apache/parquet-mr/blob/master/
>>> >>>>> >> > parquet-column/src/main/java/org/apache/parquet/column/
>>> >>>>> >> Encoding.java#L115
>>> >>>>> >> > > [5]: https://github.com/apache/impala/blob/master/be/src/
>>> >>>>> >> > exec/parquet-column-readers.cc#L495
>>> >>>>> >> >
>>> >>>>> >>
>>> >>>>> >>
>>> >>>>> >>
>>> >>>>> >> --
>>> >>>>> >> Ryan Blue
>>> >>>>> >> Software Engineer
>>> >>>>> >> Netflix
>>> >>>>> >>
>>> >>>>>
>>> >>>>
>>> >>>>
>>> >>>
>>> >>>
>>> >>> --
>>> >>> Ryan Blue
>>> >>> Software Engineer
>>> >>> Netflix
>>> >>>
>>> >>
>>> >>
>>> >
>>> >
>>> > --
>>> > Ryan Blue
>>> > Software Engineer
>>> > Netflix
>>
>>
>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix

Re: Clarifying valid uses for RLE encoding type

Reply via email to