Re: Clarifying valid uses for RLE encoding type

Ryan Blue Thu, 07 Dec 2017 15:54:11 -0800

Good point. For Parquet Java this is always passed in. I guess this is
using the type's maximum width? If so, I don't think this would be readable
by other Parquet implementations because there is no place to store the bit
width.


On Thu, Dec 7, 2017 at 3:48 PM, Tim Armstrong <[email protected]>
wrote:

> > Using the RLE encoding will be different from the plain encoding because
> you'd have the overhead bytes for runs and packed sections. We would still
> pack int64 values using the width, which is a required parameter.
> How would a reader determine the bit width though? I can't see anywhere in
> the format where the bit width is explicitly set. For the RLE level
> decoding it's implied by the max rep/def level.
>
> On Thu, Dec 7, 2017 at 3:31 PM, Ryan Blue <[email protected]> wrote:
>
>> > But if you have a int64 column, do you just store the 64 bit values
>> back-to-back? Is that different from the plain encoding?
>>
>> Using the RLE encoding will be different from the plan encoding because
>> you'd have the overhead bytes for runs and packed sections. We would still
>> pack int64 values using the width, which is a required parameter.
>>
>> > I would suggest that we make a minor revision the format document to
>> indicate that the RLE encoding is only used for boolean values, dictionary
>> indices (when using dictionary encoding, which is most of the time), and
>> the repetition and definition levels.
>>
>> Unsigned, small integers are actually a good case for using RLE codecs.
>> If you can guarantee that you won't have the msb set unless the number
>> really is large, then why not allow people to use them?
>>
>> rb
>>
>> On Thu, Dec 7, 2017 at 11:33 AM, Tim Armstrong <[email protected]>
>> wrote:
>>
>>> FWIW Impala doesn't support RLE-encoded booleans but it seems like a
>>> reasonable extension. I'm not sure if other readers support that too in
>>> practice at the moment.
>>>
>>> On Wed, Dec 6, 2017 at 6:19 PM, Wes McKinney <[email protected]>
>>> wrote:
>>>
>>>> I think the issue is that in the library (dask/fastparquet) where this
>>>> came up, dictionary encoding in general has not been implemented. So
>>>> for unsigned 8-bit integer, since you can use RLE with bit width 8 to
>>>> encode such data, this is being used as an alternative to PLAIN
>>>> encoding. But since UINT_8 is only a logical type the annotates INT32,
>>>> the RLE encoding as it's defined now cannot be used in general to
>>>> encode INT32.
>>>>
>>>> I would suggest that we make a minor revision the format document to
>>>> indicate that the RLE encoding is only used for boolean values,
>>>> dictionary indices (when using dictionary encoding, which is most of
>>>> the time), and the repetition and definition levels.
>>>>
>>>> - Wes
>>>>
>>>> On Wed, Dec 6, 2017 at 8:46 PM, Tim Armstrong <[email protected]>
>>>> wrote:
>>>> > The current RLE coding has bit-packing baked into it, so I'm
>>>> wondering what
>>>> > it even means to bit-pack a lot of the types, particularly if you
>>>> don't
>>>> > have bounds on the range of values.
>>>> >
>>>> > I can see if you have a logic int8 column stored in an int32, you have
>>>> > bounds on the values, so bit-packing would let you pack things more
>>>> densely
>>>> >
>>>> > But if you have a int64 column, do you just store the 64 bit values
>>>> > back-to-back? Is that different from the plain encoding? Or do you
>>>> select a
>>>> > bitwidth per page and store that in the page header?
>>>> >
>>>> > We also can't bit-pack types like strings at all.
>>>> >
>>>> > I guess based on that and Ryan's observation about negative numbers,
>>>> it
>>>> > sounds like getting a quality RLE encoding for isn't a trivial
>>>> extension of
>>>> > the current encoding and needs some thought.
>>>> >
>>>> >
>>>> > On Wed, Dec 6, 2017 at 2:33 PM, Ryan Blue <[email protected]>
>>>> wrote:
>>>> >
>>>> >> There isn't anything that I know of that would prevent this from
>>>> working. I
>>>> >> think the Java library would even read the data successfully because
>>>> it
>>>> >> allows pages (usually dictionary-encoded ones) to be RLE encoded.
>>>> >>
>>>> >> The main problem with this is that the RLE encoding is unaware of
>>>> negative
>>>> >> values. Any negative number causes the entire data page to be stored
>>>> with
>>>> >> plain encoding because the most-significant bit is set. So there's
>>>> just no
>>>> >> benefit to doing it.
>>>> >>
>>>> >> The fact that we don't have an encoding that takes advantage of
>>>> smaller
>>>> >> widths is why I proposed a variant of the RLE codec a while back.
>>>> >> Basically, it makes all numbers positive by zig-zag encoding (moving
>>>> the
>>>> >> sign bit to the lsb) and then allows the RLE encoding to change
>>>> packing
>>>> >> width with an extra byte. I think this would be a good one to add
>>>> for v2,
>>>> >> but this is obviously a separate issue.
>>>> >>
>>>> >> rb
>>>> >>
>>>> >> On Wed, Dec 6, 2017 at 1:58 PM, Wes McKinney <[email protected]>
>>>> wrote:
>>>> >>
>>>> >> > Sorry, to clarify, in this question:
>>>> >> >
>>>> >> >
>>>> >> > 1) Was RLE (the Hybrid-bitpacked RLE encoder used for
>>>> >> > repetition/definition levels) ever intended for use for encoding
>>>> data
>>>> >> > pages in the Parquet V1 format?
>>>> >> >
>>>> >> > I meant for encoding data pages that do not contain dictionary
>>>> indices
>>>> >> > (i.e. as an alternative to PLAIN or PLAIN_DICTIONARY/RLE_DICTIONAR
>>>> Y)
>>>> >> >
>>>> >> > On Wed, Dec 6, 2017 at 4:53 PM, Wes McKinney <[email protected]>
>>>> >> wrote:
>>>> >> > > We had a discussion recently [1] in which a Python
>>>> implementation of
>>>> >> > > Parquet had used the RLE encoding type for encoding the data
>>>> pages for
>>>> >> > > INT32 values with UINT_8 logical type (non dictionary-encoded).
>>>> >> > >
>>>> >> > > In the Encodings.md document [3] in the Parquet format, it is not
>>>> >> > > strictly indicated that the RLE encoding is to be used for
>>>> >> > > definition/repetition levels and boolean, though that is all
>>>> that is
>>>> >> > > supported in parquet-mr [4], parquet-cpp, Impala [5], and other
>>>> >> > > implementations.
>>>> >> > >
>>>> >> > > So questions:
>>>> >> > >
>>>> >> > > 1) Was RLE (the Hybrid-bitpacked RLE encoder used for
>>>> >> > > repetition/definition levels) ever intended for use for encoding
>>>> data
>>>> >> > > pages in the Parquet V1 format?
>>>> >> > >
>>>> >> > > 2) Whether yes or no, should we update apache/parquet-format to
>>>> be
>>>> >> > > more explicit about the purpose and scope of this encoding?
>>>> >> > >
>>>> >> > > Thanks,
>>>> >> > > Wes
>>>> >> > >
>>>> >> > > [1]: https://github.com/dask/fastparquet/issues/256
>>>> >> > > [2]: https://github.com/dask/fastparquet
>>>> >> > > [3]: https://github.com/apache/parq
>>>> uet-format/blob/master/Encodings.md
>>>> >> > > [4]: https://github.com/apache/parquet-mr/blob/master/
>>>> >> > parquet-column/src/main/java/org/apache/parquet/column/
>>>> >> Encoding.java#L115
>>>> >> > > [5]: https://github.com/apache/impala/blob/master/be/src/
>>>> >> > exec/parquet-column-readers.cc#L495
>>>> >> >
>>>> >>
>>>> >>
>>>> >>
>>>> >> --
>>>> >> Ryan Blue
>>>> >> Software Engineer
>>>> >> Netflix
>>>> >>
>>>>
>>>
>>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: Clarifying valid uses for RLE encoding type

Reply via email to