In the case where this arose, the developer had used the UINT_8 ConvertedType to imply a bit width of 8.
On Thu, Dec 7, 2017 at 6:53 PM, Ryan Blue <[email protected]> wrote: > Good point. For Parquet Java this is always passed in. I guess this is > using the type's maximum width? If so, I don't think this would be readable > by other Parquet implementations because there is no place to store the bit > width. > > On Thu, Dec 7, 2017 at 3:48 PM, Tim Armstrong <[email protected]> > wrote: > >> > Using the RLE encoding will be different from the plain encoding because >> you'd have the overhead bytes for runs and packed sections. We would still >> pack int64 values using the width, which is a required parameter. >> How would a reader determine the bit width though? I can't see anywhere in >> the format where the bit width is explicitly set. For the RLE level >> decoding it's implied by the max rep/def level. >> >> On Thu, Dec 7, 2017 at 3:31 PM, Ryan Blue <[email protected]> wrote: >> >>> > But if you have a int64 column, do you just store the 64 bit values >>> back-to-back? Is that different from the plain encoding? >>> >>> Using the RLE encoding will be different from the plan encoding because >>> you'd have the overhead bytes for runs and packed sections. We would still >>> pack int64 values using the width, which is a required parameter. >>> >>> > I would suggest that we make a minor revision the format document to >>> indicate that the RLE encoding is only used for boolean values, dictionary >>> indices (when using dictionary encoding, which is most of the time), and >>> the repetition and definition levels. >>> >>> Unsigned, small integers are actually a good case for using RLE codecs. >>> If you can guarantee that you won't have the msb set unless the number >>> really is large, then why not allow people to use them? >>> >>> rb >>> >>> On Thu, Dec 7, 2017 at 11:33 AM, Tim Armstrong <[email protected]> >>> wrote: >>> >>>> FWIW Impala doesn't support RLE-encoded booleans but it seems like a >>>> reasonable extension. I'm not sure if other readers support that too in >>>> practice at the moment. >>>> >>>> On Wed, Dec 6, 2017 at 6:19 PM, Wes McKinney <[email protected]> >>>> wrote: >>>> >>>>> I think the issue is that in the library (dask/fastparquet) where this >>>>> came up, dictionary encoding in general has not been implemented. So >>>>> for unsigned 8-bit integer, since you can use RLE with bit width 8 to >>>>> encode such data, this is being used as an alternative to PLAIN >>>>> encoding. But since UINT_8 is only a logical type the annotates INT32, >>>>> the RLE encoding as it's defined now cannot be used in general to >>>>> encode INT32. >>>>> >>>>> I would suggest that we make a minor revision the format document to >>>>> indicate that the RLE encoding is only used for boolean values, >>>>> dictionary indices (when using dictionary encoding, which is most of >>>>> the time), and the repetition and definition levels. >>>>> >>>>> - Wes >>>>> >>>>> On Wed, Dec 6, 2017 at 8:46 PM, Tim Armstrong <[email protected]> >>>>> wrote: >>>>> > The current RLE coding has bit-packing baked into it, so I'm >>>>> wondering what >>>>> > it even means to bit-pack a lot of the types, particularly if you >>>>> don't >>>>> > have bounds on the range of values. >>>>> > >>>>> > I can see if you have a logic int8 column stored in an int32, you have >>>>> > bounds on the values, so bit-packing would let you pack things more >>>>> densely >>>>> > >>>>> > But if you have a int64 column, do you just store the 64 bit values >>>>> > back-to-back? Is that different from the plain encoding? Or do you >>>>> select a >>>>> > bitwidth per page and store that in the page header? >>>>> > >>>>> > We also can't bit-pack types like strings at all. >>>>> > >>>>> > I guess based on that and Ryan's observation about negative numbers, >>>>> it >>>>> > sounds like getting a quality RLE encoding for isn't a trivial >>>>> extension of >>>>> > the current encoding and needs some thought. >>>>> > >>>>> > >>>>> > On Wed, Dec 6, 2017 at 2:33 PM, Ryan Blue <[email protected]> >>>>> wrote: >>>>> > >>>>> >> There isn't anything that I know of that would prevent this from >>>>> working. I >>>>> >> think the Java library would even read the data successfully because >>>>> it >>>>> >> allows pages (usually dictionary-encoded ones) to be RLE encoded. >>>>> >> >>>>> >> The main problem with this is that the RLE encoding is unaware of >>>>> negative >>>>> >> values. Any negative number causes the entire data page to be stored >>>>> with >>>>> >> plain encoding because the most-significant bit is set. So there's >>>>> just no >>>>> >> benefit to doing it. >>>>> >> >>>>> >> The fact that we don't have an encoding that takes advantage of >>>>> smaller >>>>> >> widths is why I proposed a variant of the RLE codec a while back. >>>>> >> Basically, it makes all numbers positive by zig-zag encoding (moving >>>>> the >>>>> >> sign bit to the lsb) and then allows the RLE encoding to change >>>>> packing >>>>> >> width with an extra byte. I think this would be a good one to add >>>>> for v2, >>>>> >> but this is obviously a separate issue. >>>>> >> >>>>> >> rb >>>>> >> >>>>> >> On Wed, Dec 6, 2017 at 1:58 PM, Wes McKinney <[email protected]> >>>>> wrote: >>>>> >> >>>>> >> > Sorry, to clarify, in this question: >>>>> >> > >>>>> >> > >>>>> >> > 1) Was RLE (the Hybrid-bitpacked RLE encoder used for >>>>> >> > repetition/definition levels) ever intended for use for encoding >>>>> data >>>>> >> > pages in the Parquet V1 format? >>>>> >> > >>>>> >> > I meant for encoding data pages that do not contain dictionary >>>>> indices >>>>> >> > (i.e. as an alternative to PLAIN or PLAIN_DICTIONARY/RLE_DICTIONAR >>>>> Y) >>>>> >> > >>>>> >> > On Wed, Dec 6, 2017 at 4:53 PM, Wes McKinney <[email protected]> >>>>> >> wrote: >>>>> >> > > We had a discussion recently [1] in which a Python >>>>> implementation of >>>>> >> > > Parquet had used the RLE encoding type for encoding the data >>>>> pages for >>>>> >> > > INT32 values with UINT_8 logical type (non dictionary-encoded). >>>>> >> > > >>>>> >> > > In the Encodings.md document [3] in the Parquet format, it is not >>>>> >> > > strictly indicated that the RLE encoding is to be used for >>>>> >> > > definition/repetition levels and boolean, though that is all >>>>> that is >>>>> >> > > supported in parquet-mr [4], parquet-cpp, Impala [5], and other >>>>> >> > > implementations. >>>>> >> > > >>>>> >> > > So questions: >>>>> >> > > >>>>> >> > > 1) Was RLE (the Hybrid-bitpacked RLE encoder used for >>>>> >> > > repetition/definition levels) ever intended for use for encoding >>>>> data >>>>> >> > > pages in the Parquet V1 format? >>>>> >> > > >>>>> >> > > 2) Whether yes or no, should we update apache/parquet-format to >>>>> be >>>>> >> > > more explicit about the purpose and scope of this encoding? >>>>> >> > > >>>>> >> > > Thanks, >>>>> >> > > Wes >>>>> >> > > >>>>> >> > > [1]: https://github.com/dask/fastparquet/issues/256 >>>>> >> > > [2]: https://github.com/dask/fastparquet >>>>> >> > > [3]: https://github.com/apache/parq >>>>> uet-format/blob/master/Encodings.md >>>>> >> > > [4]: https://github.com/apache/parquet-mr/blob/master/ >>>>> >> > parquet-column/src/main/java/org/apache/parquet/column/ >>>>> >> Encoding.java#L115 >>>>> >> > > [5]: https://github.com/apache/impala/blob/master/be/src/ >>>>> >> > exec/parquet-column-readers.cc#L495 >>>>> >> > >>>>> >> >>>>> >> >>>>> >> >>>>> >> -- >>>>> >> Ryan Blue >>>>> >> Software Engineer >>>>> >> Netflix >>>>> >> >>>>> >>>> >>>> >>> >>> >>> -- >>> Ryan Blue >>> Software Engineer >>> Netflix >>> >> >> > > > -- > Ryan Blue > Software Engineer > Netflix
