Good point. For Parquet Java this is always passed in. I guess this is using the type's maximum width? If so, I don't think this would be readable by other Parquet implementations because there is no place to store the bit width.
On Thu, Dec 7, 2017 at 3:48 PM, Tim Armstrong <[email protected]> wrote: > > Using the RLE encoding will be different from the plain encoding because > you'd have the overhead bytes for runs and packed sections. We would still > pack int64 values using the width, which is a required parameter. > How would a reader determine the bit width though? I can't see anywhere in > the format where the bit width is explicitly set. For the RLE level > decoding it's implied by the max rep/def level. > > On Thu, Dec 7, 2017 at 3:31 PM, Ryan Blue <[email protected]> wrote: > >> > But if you have a int64 column, do you just store the 64 bit values >> back-to-back? Is that different from the plain encoding? >> >> Using the RLE encoding will be different from the plan encoding because >> you'd have the overhead bytes for runs and packed sections. We would still >> pack int64 values using the width, which is a required parameter. >> >> > I would suggest that we make a minor revision the format document to >> indicate that the RLE encoding is only used for boolean values, dictionary >> indices (when using dictionary encoding, which is most of the time), and >> the repetition and definition levels. >> >> Unsigned, small integers are actually a good case for using RLE codecs. >> If you can guarantee that you won't have the msb set unless the number >> really is large, then why not allow people to use them? >> >> rb >> >> On Thu, Dec 7, 2017 at 11:33 AM, Tim Armstrong <[email protected]> >> wrote: >> >>> FWIW Impala doesn't support RLE-encoded booleans but it seems like a >>> reasonable extension. I'm not sure if other readers support that too in >>> practice at the moment. >>> >>> On Wed, Dec 6, 2017 at 6:19 PM, Wes McKinney <[email protected]> >>> wrote: >>> >>>> I think the issue is that in the library (dask/fastparquet) where this >>>> came up, dictionary encoding in general has not been implemented. So >>>> for unsigned 8-bit integer, since you can use RLE with bit width 8 to >>>> encode such data, this is being used as an alternative to PLAIN >>>> encoding. But since UINT_8 is only a logical type the annotates INT32, >>>> the RLE encoding as it's defined now cannot be used in general to >>>> encode INT32. >>>> >>>> I would suggest that we make a minor revision the format document to >>>> indicate that the RLE encoding is only used for boolean values, >>>> dictionary indices (when using dictionary encoding, which is most of >>>> the time), and the repetition and definition levels. >>>> >>>> - Wes >>>> >>>> On Wed, Dec 6, 2017 at 8:46 PM, Tim Armstrong <[email protected]> >>>> wrote: >>>> > The current RLE coding has bit-packing baked into it, so I'm >>>> wondering what >>>> > it even means to bit-pack a lot of the types, particularly if you >>>> don't >>>> > have bounds on the range of values. >>>> > >>>> > I can see if you have a logic int8 column stored in an int32, you have >>>> > bounds on the values, so bit-packing would let you pack things more >>>> densely >>>> > >>>> > But if you have a int64 column, do you just store the 64 bit values >>>> > back-to-back? Is that different from the plain encoding? Or do you >>>> select a >>>> > bitwidth per page and store that in the page header? >>>> > >>>> > We also can't bit-pack types like strings at all. >>>> > >>>> > I guess based on that and Ryan's observation about negative numbers, >>>> it >>>> > sounds like getting a quality RLE encoding for isn't a trivial >>>> extension of >>>> > the current encoding and needs some thought. >>>> > >>>> > >>>> > On Wed, Dec 6, 2017 at 2:33 PM, Ryan Blue <[email protected]> >>>> wrote: >>>> > >>>> >> There isn't anything that I know of that would prevent this from >>>> working. I >>>> >> think the Java library would even read the data successfully because >>>> it >>>> >> allows pages (usually dictionary-encoded ones) to be RLE encoded. >>>> >> >>>> >> The main problem with this is that the RLE encoding is unaware of >>>> negative >>>> >> values. Any negative number causes the entire data page to be stored >>>> with >>>> >> plain encoding because the most-significant bit is set. So there's >>>> just no >>>> >> benefit to doing it. >>>> >> >>>> >> The fact that we don't have an encoding that takes advantage of >>>> smaller >>>> >> widths is why I proposed a variant of the RLE codec a while back. >>>> >> Basically, it makes all numbers positive by zig-zag encoding (moving >>>> the >>>> >> sign bit to the lsb) and then allows the RLE encoding to change >>>> packing >>>> >> width with an extra byte. I think this would be a good one to add >>>> for v2, >>>> >> but this is obviously a separate issue. >>>> >> >>>> >> rb >>>> >> >>>> >> On Wed, Dec 6, 2017 at 1:58 PM, Wes McKinney <[email protected]> >>>> wrote: >>>> >> >>>> >> > Sorry, to clarify, in this question: >>>> >> > >>>> >> > >>>> >> > 1) Was RLE (the Hybrid-bitpacked RLE encoder used for >>>> >> > repetition/definition levels) ever intended for use for encoding >>>> data >>>> >> > pages in the Parquet V1 format? >>>> >> > >>>> >> > I meant for encoding data pages that do not contain dictionary >>>> indices >>>> >> > (i.e. as an alternative to PLAIN or PLAIN_DICTIONARY/RLE_DICTIONAR >>>> Y) >>>> >> > >>>> >> > On Wed, Dec 6, 2017 at 4:53 PM, Wes McKinney <[email protected]> >>>> >> wrote: >>>> >> > > We had a discussion recently [1] in which a Python >>>> implementation of >>>> >> > > Parquet had used the RLE encoding type for encoding the data >>>> pages for >>>> >> > > INT32 values with UINT_8 logical type (non dictionary-encoded). >>>> >> > > >>>> >> > > In the Encodings.md document [3] in the Parquet format, it is not >>>> >> > > strictly indicated that the RLE encoding is to be used for >>>> >> > > definition/repetition levels and boolean, though that is all >>>> that is >>>> >> > > supported in parquet-mr [4], parquet-cpp, Impala [5], and other >>>> >> > > implementations. >>>> >> > > >>>> >> > > So questions: >>>> >> > > >>>> >> > > 1) Was RLE (the Hybrid-bitpacked RLE encoder used for >>>> >> > > repetition/definition levels) ever intended for use for encoding >>>> data >>>> >> > > pages in the Parquet V1 format? >>>> >> > > >>>> >> > > 2) Whether yes or no, should we update apache/parquet-format to >>>> be >>>> >> > > more explicit about the purpose and scope of this encoding? >>>> >> > > >>>> >> > > Thanks, >>>> >> > > Wes >>>> >> > > >>>> >> > > [1]: https://github.com/dask/fastparquet/issues/256 >>>> >> > > [2]: https://github.com/dask/fastparquet >>>> >> > > [3]: https://github.com/apache/parq >>>> uet-format/blob/master/Encodings.md >>>> >> > > [4]: https://github.com/apache/parquet-mr/blob/master/ >>>> >> > parquet-column/src/main/java/org/apache/parquet/column/ >>>> >> Encoding.java#L115 >>>> >> > > [5]: https://github.com/apache/impala/blob/master/be/src/ >>>> >> > exec/parquet-column-readers.cc#L495 >>>> >> > >>>> >> >>>> >> >>>> >> >>>> >> -- >>>> >> Ryan Blue >>>> >> Software Engineer >>>> >> Netflix >>>> >> >>>> >>> >>> >> >> >> -- >> Ryan Blue >> Software Engineer >> Netflix >> > > -- Ryan Blue Software Engineer Netflix
