Re: Clarifying valid uses for RLE encoding type

Wes McKinney Fri, 08 Dec 2017 11:28:27 -0800

In the case where this arose, the developer had used the UINT_8
ConvertedType to imply a bit width of 8.


On Thu, Dec 7, 2017 at 6:53 PM, Ryan Blue <[email protected]> wrote:
> Good point. For Parquet Java this is always passed in. I guess this is
> using the type's maximum width? If so, I don't think this would be readable
> by other Parquet implementations because there is no place to store the bit
> width.
>
> On Thu, Dec 7, 2017 at 3:48 PM, Tim Armstrong <[email protected]>
> wrote:
>
>> > Using the RLE encoding will be different from the plain encoding because
>> you'd have the overhead bytes for runs and packed sections. We would still
>> pack int64 values using the width, which is a required parameter.
>> How would a reader determine the bit width though? I can't see anywhere in
>> the format where the bit width is explicitly set. For the RLE level
>> decoding it's implied by the max rep/def level.
>>
>> On Thu, Dec 7, 2017 at 3:31 PM, Ryan Blue <[email protected]> wrote:
>>
>>> > But if you have a int64 column, do you just store the 64 bit values
>>> back-to-back? Is that different from the plain encoding?
>>>
>>> Using the RLE encoding will be different from the plan encoding because
>>> you'd have the overhead bytes for runs and packed sections. We would still
>>> pack int64 values using the width, which is a required parameter.
>>>
>>> > I would suggest that we make a minor revision the format document to
>>> indicate that the RLE encoding is only used for boolean values, dictionary
>>> indices (when using dictionary encoding, which is most of the time), and
>>> the repetition and definition levels.
>>>
>>> Unsigned, small integers are actually a good case for using RLE codecs.
>>> If you can guarantee that you won't have the msb set unless the number
>>> really is large, then why not allow people to use them?
>>>
>>> rb
>>>
>>> On Thu, Dec 7, 2017 at 11:33 AM, Tim Armstrong <[email protected]>
>>> wrote:
>>>
>>>> FWIW Impala doesn't support RLE-encoded booleans but it seems like a
>>>> reasonable extension. I'm not sure if other readers support that too in
>>>> practice at the moment.
>>>>
>>>> On Wed, Dec 6, 2017 at 6:19 PM, Wes McKinney <[email protected]>
>>>> wrote:
>>>>
>>>>> I think the issue is that in the library (dask/fastparquet) where this
>>>>> came up, dictionary encoding in general has not been implemented. So
>>>>> for unsigned 8-bit integer, since you can use RLE with bit width 8 to
>>>>> encode such data, this is being used as an alternative to PLAIN
>>>>> encoding. But since UINT_8 is only a logical type the annotates INT32,
>>>>> the RLE encoding as it's defined now cannot be used in general to
>>>>> encode INT32.
>>>>>
>>>>> I would suggest that we make a minor revision the format document to
>>>>> indicate that the RLE encoding is only used for boolean values,
>>>>> dictionary indices (when using dictionary encoding, which is most of
>>>>> the time), and the repetition and definition levels.
>>>>>
>>>>> - Wes
>>>>>
>>>>> On Wed, Dec 6, 2017 at 8:46 PM, Tim Armstrong <[email protected]>
>>>>> wrote:
>>>>> > The current RLE coding has bit-packing baked into it, so I'm
>>>>> wondering what
>>>>> > it even means to bit-pack a lot of the types, particularly if you
>>>>> don't
>>>>> > have bounds on the range of values.
>>>>> >
>>>>> > I can see if you have a logic int8 column stored in an int32, you have
>>>>> > bounds on the values, so bit-packing would let you pack things more
>>>>> densely
>>>>> >
>>>>> > But if you have a int64 column, do you just store the 64 bit values
>>>>> > back-to-back? Is that different from the plain encoding? Or do you
>>>>> select a
>>>>> > bitwidth per page and store that in the page header?
>>>>> >
>>>>> > We also can't bit-pack types like strings at all.
>>>>> >
>>>>> > I guess based on that and Ryan's observation about negative numbers,
>>>>> it
>>>>> > sounds like getting a quality RLE encoding for isn't a trivial
>>>>> extension of
>>>>> > the current encoding and needs some thought.
>>>>> >
>>>>> >
>>>>> > On Wed, Dec 6, 2017 at 2:33 PM, Ryan Blue <[email protected]>
>>>>> wrote:
>>>>> >
>>>>> >> There isn't anything that I know of that would prevent this from
>>>>> working. I
>>>>> >> think the Java library would even read the data successfully because
>>>>> it
>>>>> >> allows pages (usually dictionary-encoded ones) to be RLE encoded.
>>>>> >>
>>>>> >> The main problem with this is that the RLE encoding is unaware of
>>>>> negative
>>>>> >> values. Any negative number causes the entire data page to be stored
>>>>> with
>>>>> >> plain encoding because the most-significant bit is set. So there's
>>>>> just no
>>>>> >> benefit to doing it.
>>>>> >>
>>>>> >> The fact that we don't have an encoding that takes advantage of
>>>>> smaller
>>>>> >> widths is why I proposed a variant of the RLE codec a while back.
>>>>> >> Basically, it makes all numbers positive by zig-zag encoding (moving
>>>>> the
>>>>> >> sign bit to the lsb) and then allows the RLE encoding to change
>>>>> packing
>>>>> >> width with an extra byte. I think this would be a good one to add
>>>>> for v2,
>>>>> >> but this is obviously a separate issue.
>>>>> >>
>>>>> >> rb
>>>>> >>
>>>>> >> On Wed, Dec 6, 2017 at 1:58 PM, Wes McKinney <[email protected]>
>>>>> wrote:
>>>>> >>
>>>>> >> > Sorry, to clarify, in this question:
>>>>> >> >
>>>>> >> >
>>>>> >> > 1) Was RLE (the Hybrid-bitpacked RLE encoder used for
>>>>> >> > repetition/definition levels) ever intended for use for encoding
>>>>> data
>>>>> >> > pages in the Parquet V1 format?
>>>>> >> >
>>>>> >> > I meant for encoding data pages that do not contain dictionary
>>>>> indices
>>>>> >> > (i.e. as an alternative to PLAIN or PLAIN_DICTIONARY/RLE_DICTIONAR
>>>>> Y)
>>>>> >> >
>>>>> >> > On Wed, Dec 6, 2017 at 4:53 PM, Wes McKinney <[email protected]>
>>>>> >> wrote:
>>>>> >> > > We had a discussion recently [1] in which a Python
>>>>> implementation of
>>>>> >> > > Parquet had used the RLE encoding type for encoding the data
>>>>> pages for
>>>>> >> > > INT32 values with UINT_8 logical type (non dictionary-encoded).
>>>>> >> > >
>>>>> >> > > In the Encodings.md document [3] in the Parquet format, it is not
>>>>> >> > > strictly indicated that the RLE encoding is to be used for
>>>>> >> > > definition/repetition levels and boolean, though that is all
>>>>> that is
>>>>> >> > > supported in parquet-mr [4], parquet-cpp, Impala [5], and other
>>>>> >> > > implementations.
>>>>> >> > >
>>>>> >> > > So questions:
>>>>> >> > >
>>>>> >> > > 1) Was RLE (the Hybrid-bitpacked RLE encoder used for
>>>>> >> > > repetition/definition levels) ever intended for use for encoding
>>>>> data
>>>>> >> > > pages in the Parquet V1 format?
>>>>> >> > >
>>>>> >> > > 2) Whether yes or no, should we update apache/parquet-format to
>>>>> be
>>>>> >> > > more explicit about the purpose and scope of this encoding?
>>>>> >> > >
>>>>> >> > > Thanks,
>>>>> >> > > Wes
>>>>> >> > >
>>>>> >> > > [1]: https://github.com/dask/fastparquet/issues/256
>>>>> >> > > [2]: https://github.com/dask/fastparquet
>>>>> >> > > [3]: https://github.com/apache/parq
>>>>> uet-format/blob/master/Encodings.md
>>>>> >> > > [4]: https://github.com/apache/parquet-mr/blob/master/
>>>>> >> > parquet-column/src/main/java/org/apache/parquet/column/
>>>>> >> Encoding.java#L115
>>>>> >> > > [5]: https://github.com/apache/impala/blob/master/be/src/
>>>>> >> > exec/parquet-column-readers.cc#L495
>>>>> >> >
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> --
>>>>> >> Ryan Blue
>>>>> >> Software Engineer
>>>>> >> Netflix
>>>>> >>
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix

Re: Clarifying valid uses for RLE encoding type

Reply via email to