Ah I see - that's definitely not part of the format then, since it requires the reader and writer to agree on the algorithm for deciding bitwidth, and there's no mention of that in the format. It seems like to do this properly, the writer should be able to specify the bitwidth explicitly per-page to also handle cases where the actual values encoded do not need the full bitwidth implied by the type.
On Fri, Dec 8, 2017 at 11:26 AM, Wes McKinney <[email protected]> wrote: > In the case where this arose, the developer had used the UINT_8 > ConvertedType to imply a bit width of 8. > > On Thu, Dec 7, 2017 at 6:53 PM, Ryan Blue <[email protected]> > wrote: > > Good point. For Parquet Java this is always passed in. I guess this is > > using the type's maximum width? If so, I don't think this would be > readable > > by other Parquet implementations because there is no place to store the > bit > > width. > > > > On Thu, Dec 7, 2017 at 3:48 PM, Tim Armstrong <[email protected]> > > wrote: > > > >> > Using the RLE encoding will be different from the plain encoding > because > >> you'd have the overhead bytes for runs and packed sections. We would > still > >> pack int64 values using the width, which is a required parameter. > >> How would a reader determine the bit width though? I can't see anywhere > in > >> the format where the bit width is explicitly set. For the RLE level > >> decoding it's implied by the max rep/def level. > >> > >> On Thu, Dec 7, 2017 at 3:31 PM, Ryan Blue <[email protected]> wrote: > >> > >>> > But if you have a int64 column, do you just store the 64 bit values > >>> back-to-back? Is that different from the plain encoding? > >>> > >>> Using the RLE encoding will be different from the plan encoding because > >>> you'd have the overhead bytes for runs and packed sections. We would > still > >>> pack int64 values using the width, which is a required parameter. > >>> > >>> > I would suggest that we make a minor revision the format document to > >>> indicate that the RLE encoding is only used for boolean values, > dictionary > >>> indices (when using dictionary encoding, which is most of the time), > and > >>> the repetition and definition levels. > >>> > >>> Unsigned, small integers are actually a good case for using RLE codecs. > >>> If you can guarantee that you won't have the msb set unless the number > >>> really is large, then why not allow people to use them? > >>> > >>> rb > >>> > >>> On Thu, Dec 7, 2017 at 11:33 AM, Tim Armstrong < > [email protected]> > >>> wrote: > >>> > >>>> FWIW Impala doesn't support RLE-encoded booleans but it seems like a > >>>> reasonable extension. I'm not sure if other readers support that too > in > >>>> practice at the moment. > >>>> > >>>> On Wed, Dec 6, 2017 at 6:19 PM, Wes McKinney <[email protected]> > >>>> wrote: > >>>> > >>>>> I think the issue is that in the library (dask/fastparquet) where > this > >>>>> came up, dictionary encoding in general has not been implemented. So > >>>>> for unsigned 8-bit integer, since you can use RLE with bit width 8 to > >>>>> encode such data, this is being used as an alternative to PLAIN > >>>>> encoding. But since UINT_8 is only a logical type the annotates > INT32, > >>>>> the RLE encoding as it's defined now cannot be used in general to > >>>>> encode INT32. > >>>>> > >>>>> I would suggest that we make a minor revision the format document to > >>>>> indicate that the RLE encoding is only used for boolean values, > >>>>> dictionary indices (when using dictionary encoding, which is most of > >>>>> the time), and the repetition and definition levels. > >>>>> > >>>>> - Wes > >>>>> > >>>>> On Wed, Dec 6, 2017 at 8:46 PM, Tim Armstrong < > [email protected]> > >>>>> wrote: > >>>>> > The current RLE coding has bit-packing baked into it, so I'm > >>>>> wondering what > >>>>> > it even means to bit-pack a lot of the types, particularly if you > >>>>> don't > >>>>> > have bounds on the range of values. > >>>>> > > >>>>> > I can see if you have a logic int8 column stored in an int32, you > have > >>>>> > bounds on the values, so bit-packing would let you pack things more > >>>>> densely > >>>>> > > >>>>> > But if you have a int64 column, do you just store the 64 bit values > >>>>> > back-to-back? Is that different from the plain encoding? Or do you > >>>>> select a > >>>>> > bitwidth per page and store that in the page header? > >>>>> > > >>>>> > We also can't bit-pack types like strings at all. > >>>>> > > >>>>> > I guess based on that and Ryan's observation about negative > numbers, > >>>>> it > >>>>> > sounds like getting a quality RLE encoding for isn't a trivial > >>>>> extension of > >>>>> > the current encoding and needs some thought. > >>>>> > > >>>>> > > >>>>> > On Wed, Dec 6, 2017 at 2:33 PM, Ryan Blue > <[email protected]> > >>>>> wrote: > >>>>> > > >>>>> >> There isn't anything that I know of that would prevent this from > >>>>> working. I > >>>>> >> think the Java library would even read the data successfully > because > >>>>> it > >>>>> >> allows pages (usually dictionary-encoded ones) to be RLE encoded. > >>>>> >> > >>>>> >> The main problem with this is that the RLE encoding is unaware of > >>>>> negative > >>>>> >> values. Any negative number causes the entire data page to be > stored > >>>>> with > >>>>> >> plain encoding because the most-significant bit is set. So there's > >>>>> just no > >>>>> >> benefit to doing it. > >>>>> >> > >>>>> >> The fact that we don't have an encoding that takes advantage of > >>>>> smaller > >>>>> >> widths is why I proposed a variant of the RLE codec a while back. > >>>>> >> Basically, it makes all numbers positive by zig-zag encoding > (moving > >>>>> the > >>>>> >> sign bit to the lsb) and then allows the RLE encoding to change > >>>>> packing > >>>>> >> width with an extra byte. I think this would be a good one to add > >>>>> for v2, > >>>>> >> but this is obviously a separate issue. > >>>>> >> > >>>>> >> rb > >>>>> >> > >>>>> >> On Wed, Dec 6, 2017 at 1:58 PM, Wes McKinney <[email protected] > > > >>>>> wrote: > >>>>> >> > >>>>> >> > Sorry, to clarify, in this question: > >>>>> >> > > >>>>> >> > > >>>>> >> > 1) Was RLE (the Hybrid-bitpacked RLE encoder used for > >>>>> >> > repetition/definition levels) ever intended for use for encoding > >>>>> data > >>>>> >> > pages in the Parquet V1 format? > >>>>> >> > > >>>>> >> > I meant for encoding data pages that do not contain dictionary > >>>>> indices > >>>>> >> > (i.e. as an alternative to PLAIN or > PLAIN_DICTIONARY/RLE_DICTIONAR > >>>>> Y) > >>>>> >> > > >>>>> >> > On Wed, Dec 6, 2017 at 4:53 PM, Wes McKinney < > [email protected]> > >>>>> >> wrote: > >>>>> >> > > We had a discussion recently [1] in which a Python > >>>>> implementation of > >>>>> >> > > Parquet had used the RLE encoding type for encoding the data > >>>>> pages for > >>>>> >> > > INT32 values with UINT_8 logical type (non > dictionary-encoded). > >>>>> >> > > > >>>>> >> > > In the Encodings.md document [3] in the Parquet format, it is > not > >>>>> >> > > strictly indicated that the RLE encoding is to be used for > >>>>> >> > > definition/repetition levels and boolean, though that is all > >>>>> that is > >>>>> >> > > supported in parquet-mr [4], parquet-cpp, Impala [5], and > other > >>>>> >> > > implementations. > >>>>> >> > > > >>>>> >> > > So questions: > >>>>> >> > > > >>>>> >> > > 1) Was RLE (the Hybrid-bitpacked RLE encoder used for > >>>>> >> > > repetition/definition levels) ever intended for use for > encoding > >>>>> data > >>>>> >> > > pages in the Parquet V1 format? > >>>>> >> > > > >>>>> >> > > 2) Whether yes or no, should we update apache/parquet-format > to > >>>>> be > >>>>> >> > > more explicit about the purpose and scope of this encoding? > >>>>> >> > > > >>>>> >> > > Thanks, > >>>>> >> > > Wes > >>>>> >> > > > >>>>> >> > > [1]: https://github.com/dask/fastparquet/issues/256 > >>>>> >> > > [2]: https://github.com/dask/fastparquet > >>>>> >> > > [3]: https://github.com/apache/parq > >>>>> uet-format/blob/master/Encodings.md > >>>>> >> > > [4]: https://github.com/apache/parquet-mr/blob/master/ > >>>>> >> > parquet-column/src/main/java/org/apache/parquet/column/ > >>>>> >> Encoding.java#L115 > >>>>> >> > > [5]: https://github.com/apache/impala/blob/master/be/src/ > >>>>> >> > exec/parquet-column-readers.cc#L495 > >>>>> >> > > >>>>> >> > >>>>> >> > >>>>> >> > >>>>> >> -- > >>>>> >> Ryan Blue > >>>>> >> Software Engineer > >>>>> >> Netflix > >>>>> >> > >>>>> > >>>> > >>>> > >>> > >>> > >>> -- > >>> Ryan Blue > >>> Software Engineer > >>> Netflix > >>> > >> > >> > > > > > > -- > > Ryan Blue > > Software Engineer > > Netflix >
