Sounds good to me. On Wed, Dec 13, 2017 at 10:38 AM, Wes McKinney <[email protected]> wrote:
> Alright, so what I've taken from this thread is that we should make a > small amendment to the format documents to indicate valid uses for the > RLE encoding -- at least as far as Parquet format V1 is concerned. > These consist of: > > * Repetition and definition levels > * Indices in dictionary pages > * Boolean values > > This would help resolve the confusion that we experienced in > https://github.com/dask/fastparquet/issues/256. Does that sound right? > > Thanks > Wes > > On Mon, Dec 11, 2017 at 12:52 PM, Ryan Blue <[email protected]> wrote: > >> to do this properly, the writer should be able to specify the bitwidth > >> explicitly per-page > > > > In the encoders I proposed to do this, I added a byte to bit-packed runs > > that encodes the width: > > https://github.com/rdblue/parquet-mr/blob/encoders/ > parquet-column/src/main/java/org/apache/parquet/column/values/zigzag/ > VariableWidthRLEEncoder.java#L197 > > > > Adding a byte also allows the width to change within a page. It isn't as > > effective as patching for large values, but there are a couple benefits. > > First, you don't use the largest width for all values if you have just > one > > large one. Second, you don't have to keep an entire page worth of values > in > > memory before encoding because you can widen the bit packed runs and > write > > incrementally. > > > > rb > > > > On Fri, Dec 8, 2017 at 11:40 AM, Tim Armstrong <[email protected]> > > wrote: > >> > >> Ah I see - that's definitely not part of the format then, since it > >> requires the reader and writer to agree on the algorithm for deciding > >> bitwidth, and there's no mention of that in the format. It seems like > to do > >> this properly, the writer should be able to specify the bitwidth > explicitly > >> per-page to also handle cases where the actual values encoded do not > need > >> the full bitwidth implied by the type. > >> > >> On Fri, Dec 8, 2017 at 11:26 AM, Wes McKinney <[email protected]> > wrote: > >>> > >>> In the case where this arose, the developer had used the UINT_8 > >>> ConvertedType to imply a bit width of 8. > >>> > >>> On Thu, Dec 7, 2017 at 6:53 PM, Ryan Blue <[email protected]> > >>> wrote: > >>> > Good point. For Parquet Java this is always passed in. I guess this > is > >>> > using the type's maximum width? If so, I don't think this would be > >>> > readable > >>> > by other Parquet implementations because there is no place to store > the > >>> > bit > >>> > width. > >>> > > >>> > On Thu, Dec 7, 2017 at 3:48 PM, Tim Armstrong < > [email protected]> > >>> > wrote: > >>> > > >>> >> > Using the RLE encoding will be different from the plain encoding > >>> >> > because > >>> >> you'd have the overhead bytes for runs and packed sections. We would > >>> >> still > >>> >> pack int64 values using the width, which is a required parameter. > >>> >> How would a reader determine the bit width though? I can't see > >>> >> anywhere in > >>> >> the format where the bit width is explicitly set. For the RLE level > >>> >> decoding it's implied by the max rep/def level. > >>> >> > >>> >> On Thu, Dec 7, 2017 at 3:31 PM, Ryan Blue <[email protected]> > wrote: > >>> >> > >>> >>> > But if you have a int64 column, do you just store the 64 bit > values > >>> >>> back-to-back? Is that different from the plain encoding? > >>> >>> > >>> >>> Using the RLE encoding will be different from the plan encoding > >>> >>> because > >>> >>> you'd have the overhead bytes for runs and packed sections. We > would > >>> >>> still > >>> >>> pack int64 values using the width, which is a required parameter. > >>> >>> > >>> >>> > I would suggest that we make a minor revision the format document > >>> >>> > to > >>> >>> indicate that the RLE encoding is only used for boolean values, > >>> >>> dictionary > >>> >>> indices (when using dictionary encoding, which is most of the > time), > >>> >>> and > >>> >>> the repetition and definition levels. > >>> >>> > >>> >>> Unsigned, small integers are actually a good case for using RLE > >>> >>> codecs. > >>> >>> If you can guarantee that you won't have the msb set unless the > >>> >>> number > >>> >>> really is large, then why not allow people to use them? > >>> >>> > >>> >>> rb > >>> >>> > >>> >>> On Thu, Dec 7, 2017 at 11:33 AM, Tim Armstrong > >>> >>> <[email protected]> > >>> >>> wrote: > >>> >>> > >>> >>>> FWIW Impala doesn't support RLE-encoded booleans but it seems > like a > >>> >>>> reasonable extension. I'm not sure if other readers support that > too > >>> >>>> in > >>> >>>> practice at the moment. > >>> >>>> > >>> >>>> On Wed, Dec 6, 2017 at 6:19 PM, Wes McKinney <[email protected] > > > >>> >>>> wrote: > >>> >>>> > >>> >>>>> I think the issue is that in the library (dask/fastparquet) where > >>> >>>>> this > >>> >>>>> came up, dictionary encoding in general has not been implemented. > >>> >>>>> So > >>> >>>>> for unsigned 8-bit integer, since you can use RLE with bit width > 8 > >>> >>>>> to > >>> >>>>> encode such data, this is being used as an alternative to PLAIN > >>> >>>>> encoding. But since UINT_8 is only a logical type the annotates > >>> >>>>> INT32, > >>> >>>>> the RLE encoding as it's defined now cannot be used in general to > >>> >>>>> encode INT32. > >>> >>>>> > >>> >>>>> I would suggest that we make a minor revision the format document > >>> >>>>> to > >>> >>>>> indicate that the RLE encoding is only used for boolean values, > >>> >>>>> dictionary indices (when using dictionary encoding, which is most > >>> >>>>> of > >>> >>>>> the time), and the repetition and definition levels. > >>> >>>>> > >>> >>>>> - Wes > >>> >>>>> > >>> >>>>> On Wed, Dec 6, 2017 at 8:46 PM, Tim Armstrong > >>> >>>>> <[email protected]> > >>> >>>>> wrote: > >>> >>>>> > The current RLE coding has bit-packing baked into it, so I'm > >>> >>>>> wondering what > >>> >>>>> > it even means to bit-pack a lot of the types, particularly if > you > >>> >>>>> don't > >>> >>>>> > have bounds on the range of values. > >>> >>>>> > > >>> >>>>> > I can see if you have a logic int8 column stored in an int32, > you > >>> >>>>> > have > >>> >>>>> > bounds on the values, so bit-packing would let you pack things > >>> >>>>> > more > >>> >>>>> densely > >>> >>>>> > > >>> >>>>> > But if you have a int64 column, do you just store the 64 bit > >>> >>>>> > values > >>> >>>>> > back-to-back? Is that different from the plain encoding? Or do > >>> >>>>> > you > >>> >>>>> select a > >>> >>>>> > bitwidth per page and store that in the page header? > >>> >>>>> > > >>> >>>>> > We also can't bit-pack types like strings at all. > >>> >>>>> > > >>> >>>>> > I guess based on that and Ryan's observation about negative > >>> >>>>> > numbers, > >>> >>>>> it > >>> >>>>> > sounds like getting a quality RLE encoding for isn't a trivial > >>> >>>>> extension of > >>> >>>>> > the current encoding and needs some thought. > >>> >>>>> > > >>> >>>>> > > >>> >>>>> > On Wed, Dec 6, 2017 at 2:33 PM, Ryan Blue > >>> >>>>> > <[email protected]> > >>> >>>>> wrote: > >>> >>>>> > > >>> >>>>> >> There isn't anything that I know of that would prevent this > from > >>> >>>>> working. I > >>> >>>>> >> think the Java library would even read the data successfully > >>> >>>>> >> because > >>> >>>>> it > >>> >>>>> >> allows pages (usually dictionary-encoded ones) to be RLE > >>> >>>>> >> encoded. > >>> >>>>> >> > >>> >>>>> >> The main problem with this is that the RLE encoding is unaware > >>> >>>>> >> of > >>> >>>>> negative > >>> >>>>> >> values. Any negative number causes the entire data page to be > >>> >>>>> >> stored > >>> >>>>> with > >>> >>>>> >> plain encoding because the most-significant bit is set. So > >>> >>>>> >> there's > >>> >>>>> just no > >>> >>>>> >> benefit to doing it. > >>> >>>>> >> > >>> >>>>> >> The fact that we don't have an encoding that takes advantage > of > >>> >>>>> smaller > >>> >>>>> >> widths is why I proposed a variant of the RLE codec a while > >>> >>>>> >> back. > >>> >>>>> >> Basically, it makes all numbers positive by zig-zag encoding > >>> >>>>> >> (moving > >>> >>>>> the > >>> >>>>> >> sign bit to the lsb) and then allows the RLE encoding to > change > >>> >>>>> packing > >>> >>>>> >> width with an extra byte. I think this would be a good one to > >>> >>>>> >> add > >>> >>>>> for v2, > >>> >>>>> >> but this is obviously a separate issue. > >>> >>>>> >> > >>> >>>>> >> rb > >>> >>>>> >> > >>> >>>>> >> On Wed, Dec 6, 2017 at 1:58 PM, Wes McKinney > >>> >>>>> >> <[email protected]> > >>> >>>>> wrote: > >>> >>>>> >> > >>> >>>>> >> > Sorry, to clarify, in this question: > >>> >>>>> >> > > >>> >>>>> >> > > >>> >>>>> >> > 1) Was RLE (the Hybrid-bitpacked RLE encoder used for > >>> >>>>> >> > repetition/definition levels) ever intended for use for > >>> >>>>> >> > encoding > >>> >>>>> data > >>> >>>>> >> > pages in the Parquet V1 format? > >>> >>>>> >> > > >>> >>>>> >> > I meant for encoding data pages that do not contain > dictionary > >>> >>>>> indices > >>> >>>>> >> > (i.e. as an alternative to PLAIN or > >>> >>>>> >> > PLAIN_DICTIONARY/RLE_DICTIONAR > >>> >>>>> Y) > >>> >>>>> >> > > >>> >>>>> >> > On Wed, Dec 6, 2017 at 4:53 PM, Wes McKinney > >>> >>>>> >> > <[email protected]> > >>> >>>>> >> wrote: > >>> >>>>> >> > > We had a discussion recently [1] in which a Python > >>> >>>>> implementation of > >>> >>>>> >> > > Parquet had used the RLE encoding type for encoding the > data > >>> >>>>> pages for > >>> >>>>> >> > > INT32 values with UINT_8 logical type (non > >>> >>>>> >> > > dictionary-encoded). > >>> >>>>> >> > > > >>> >>>>> >> > > In the Encodings.md document [3] in the Parquet format, it > >>> >>>>> >> > > is not > >>> >>>>> >> > > strictly indicated that the RLE encoding is to be used for > >>> >>>>> >> > > definition/repetition levels and boolean, though that is > all > >>> >>>>> that is > >>> >>>>> >> > > supported in parquet-mr [4], parquet-cpp, Impala [5], and > >>> >>>>> >> > > other > >>> >>>>> >> > > implementations. > >>> >>>>> >> > > > >>> >>>>> >> > > So questions: > >>> >>>>> >> > > > >>> >>>>> >> > > 1) Was RLE (the Hybrid-bitpacked RLE encoder used for > >>> >>>>> >> > > repetition/definition levels) ever intended for use for > >>> >>>>> >> > > encoding > >>> >>>>> data > >>> >>>>> >> > > pages in the Parquet V1 format? > >>> >>>>> >> > > > >>> >>>>> >> > > 2) Whether yes or no, should we update > apache/parquet-format > >>> >>>>> >> > > to > >>> >>>>> be > >>> >>>>> >> > > more explicit about the purpose and scope of this > encoding? > >>> >>>>> >> > > > >>> >>>>> >> > > Thanks, > >>> >>>>> >> > > Wes > >>> >>>>> >> > > > >>> >>>>> >> > > [1]: https://github.com/dask/fastparquet/issues/256 > >>> >>>>> >> > > [2]: https://github.com/dask/fastparquet > >>> >>>>> >> > > [3]: https://github.com/apache/parq > >>> >>>>> uet-format/blob/master/Encodings.md > >>> >>>>> >> > > [4]: https://github.com/apache/parquet-mr/blob/master/ > >>> >>>>> >> > parquet-column/src/main/java/org/apache/parquet/column/ > >>> >>>>> >> Encoding.java#L115 > >>> >>>>> >> > > [5]: https://github.com/apache/impala/blob/master/be/src/ > >>> >>>>> >> > exec/parquet-column-readers.cc#L495 > >>> >>>>> >> > > >>> >>>>> >> > >>> >>>>> >> > >>> >>>>> >> > >>> >>>>> >> -- > >>> >>>>> >> Ryan Blue > >>> >>>>> >> Software Engineer > >>> >>>>> >> Netflix > >>> >>>>> >> > >>> >>>>> > >>> >>>> > >>> >>>> > >>> >>> > >>> >>> > >>> >>> -- > >>> >>> Ryan Blue > >>> >>> Software Engineer > >>> >>> Netflix > >>> >>> > >>> >> > >>> >> > >>> > > >>> > > >>> > -- > >>> > Ryan Blue > >>> > Software Engineer > >>> > Netflix > >> > >> > > > > > > > > -- > > Ryan Blue > > Software Engineer > > Netflix > -- Ryan Blue Software Engineer Netflix
