> But if you have a int64 column, do you just store the 64 bit values
back-to-back? Is that different from the plain encoding?

Using the RLE encoding will be different from the plan encoding because
you'd have the overhead bytes for runs and packed sections. We would still
pack int64 values using the width, which is a required parameter.

> I would suggest that we make a minor revision the format document to
indicate that the RLE encoding is only used for boolean values, dictionary
indices (when using dictionary encoding, which is most of the time), and
the repetition and definition levels.

Unsigned, small integers are actually a good case for using RLE codecs. If
you can guarantee that you won't have the msb set unless the number really
is large, then why not allow people to use them?

rb

On Thu, Dec 7, 2017 at 11:33 AM, Tim Armstrong <[email protected]>
wrote:

> FWIW Impala doesn't support RLE-encoded booleans but it seems like a
> reasonable extension. I'm not sure if other readers support that too in
> practice at the moment.
>
> On Wed, Dec 6, 2017 at 6:19 PM, Wes McKinney <[email protected]> wrote:
>
>> I think the issue is that in the library (dask/fastparquet) where this
>> came up, dictionary encoding in general has not been implemented. So
>> for unsigned 8-bit integer, since you can use RLE with bit width 8 to
>> encode such data, this is being used as an alternative to PLAIN
>> encoding. But since UINT_8 is only a logical type the annotates INT32,
>> the RLE encoding as it's defined now cannot be used in general to
>> encode INT32.
>>
>> I would suggest that we make a minor revision the format document to
>> indicate that the RLE encoding is only used for boolean values,
>> dictionary indices (when using dictionary encoding, which is most of
>> the time), and the repetition and definition levels.
>>
>> - Wes
>>
>> On Wed, Dec 6, 2017 at 8:46 PM, Tim Armstrong <[email protected]>
>> wrote:
>> > The current RLE coding has bit-packing baked into it, so I'm wondering
>> what
>> > it even means to bit-pack a lot of the types, particularly if you don't
>> > have bounds on the range of values.
>> >
>> > I can see if you have a logic int8 column stored in an int32, you have
>> > bounds on the values, so bit-packing would let you pack things more
>> densely
>> >
>> > But if you have a int64 column, do you just store the 64 bit values
>> > back-to-back? Is that different from the plain encoding? Or do you
>> select a
>> > bitwidth per page and store that in the page header?
>> >
>> > We also can't bit-pack types like strings at all.
>> >
>> > I guess based on that and Ryan's observation about negative numbers, it
>> > sounds like getting a quality RLE encoding for isn't a trivial
>> extension of
>> > the current encoding and needs some thought.
>> >
>> >
>> > On Wed, Dec 6, 2017 at 2:33 PM, Ryan Blue <[email protected]>
>> wrote:
>> >
>> >> There isn't anything that I know of that would prevent this from
>> working. I
>> >> think the Java library would even read the data successfully because it
>> >> allows pages (usually dictionary-encoded ones) to be RLE encoded.
>> >>
>> >> The main problem with this is that the RLE encoding is unaware of
>> negative
>> >> values. Any negative number causes the entire data page to be stored
>> with
>> >> plain encoding because the most-significant bit is set. So there's
>> just no
>> >> benefit to doing it.
>> >>
>> >> The fact that we don't have an encoding that takes advantage of smaller
>> >> widths is why I proposed a variant of the RLE codec a while back.
>> >> Basically, it makes all numbers positive by zig-zag encoding (moving
>> the
>> >> sign bit to the lsb) and then allows the RLE encoding to change packing
>> >> width with an extra byte. I think this would be a good one to add for
>> v2,
>> >> but this is obviously a separate issue.
>> >>
>> >> rb
>> >>
>> >> On Wed, Dec 6, 2017 at 1:58 PM, Wes McKinney <[email protected]>
>> wrote:
>> >>
>> >> > Sorry, to clarify, in this question:
>> >> >
>> >> >
>> >> > 1) Was RLE (the Hybrid-bitpacked RLE encoder used for
>> >> > repetition/definition levels) ever intended for use for encoding data
>> >> > pages in the Parquet V1 format?
>> >> >
>> >> > I meant for encoding data pages that do not contain dictionary
>> indices
>> >> > (i.e. as an alternative to PLAIN or PLAIN_DICTIONARY/RLE_DICTIONARY)
>> >> >
>> >> > On Wed, Dec 6, 2017 at 4:53 PM, Wes McKinney <[email protected]>
>> >> wrote:
>> >> > > We had a discussion recently [1] in which a Python implementation
>> of
>> >> > > Parquet had used the RLE encoding type for encoding the data pages
>> for
>> >> > > INT32 values with UINT_8 logical type (non dictionary-encoded).
>> >> > >
>> >> > > In the Encodings.md document [3] in the Parquet format, it is not
>> >> > > strictly indicated that the RLE encoding is to be used for
>> >> > > definition/repetition levels and boolean, though that is all that
>> is
>> >> > > supported in parquet-mr [4], parquet-cpp, Impala [5], and other
>> >> > > implementations.
>> >> > >
>> >> > > So questions:
>> >> > >
>> >> > > 1) Was RLE (the Hybrid-bitpacked RLE encoder used for
>> >> > > repetition/definition levels) ever intended for use for encoding
>> data
>> >> > > pages in the Parquet V1 format?
>> >> > >
>> >> > > 2) Whether yes or no, should we update apache/parquet-format to be
>> >> > > more explicit about the purpose and scope of this encoding?
>> >> > >
>> >> > > Thanks,
>> >> > > Wes
>> >> > >
>> >> > > [1]: https://github.com/dask/fastparquet/issues/256
>> >> > > [2]: https://github.com/dask/fastparquet
>> >> > > [3]: https://github.com/apache/parquet-format/blob/master/Encodin
>> gs.md
>> >> > > [4]: https://github.com/apache/parquet-mr/blob/master/
>> >> > parquet-column/src/main/java/org/apache/parquet/column/
>> >> Encoding.java#L115
>> >> > > [5]: https://github.com/apache/impala/blob/master/be/src/
>> >> > exec/parquet-column-readers.cc#L495
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Ryan Blue
>> >> Software Engineer
>> >> Netflix
>> >>
>>
>
>


-- 
Ryan Blue
Software Engineer
Netflix

Reply via email to