Alright, so what I've taken from this thread is that we should make a small amendment to the format documents to indicate valid uses for the RLE encoding -- at least as far as Parquet format V1 is concerned. These consist of:
* Repetition and definition levels * Indices in dictionary pages * Boolean values This would help resolve the confusion that we experienced in https://github.com/dask/fastparquet/issues/256. Does that sound right? Thanks Wes On Mon, Dec 11, 2017 at 12:52 PM, Ryan Blue <[email protected]> wrote: >> to do this properly, the writer should be able to specify the bitwidth >> explicitly per-page > > In the encoders I proposed to do this, I added a byte to bit-packed runs > that encodes the width: > https://github.com/rdblue/parquet-mr/blob/encoders/parquet-column/src/main/java/org/apache/parquet/column/values/zigzag/VariableWidthRLEEncoder.java#L197 > > Adding a byte also allows the width to change within a page. It isn't as > effective as patching for large values, but there are a couple benefits. > First, you don't use the largest width for all values if you have just one > large one. Second, you don't have to keep an entire page worth of values in > memory before encoding because you can widen the bit packed runs and write > incrementally. > > rb > > On Fri, Dec 8, 2017 at 11:40 AM, Tim Armstrong <[email protected]> > wrote: >> >> Ah I see - that's definitely not part of the format then, since it >> requires the reader and writer to agree on the algorithm for deciding >> bitwidth, and there's no mention of that in the format. It seems like to do >> this properly, the writer should be able to specify the bitwidth explicitly >> per-page to also handle cases where the actual values encoded do not need >> the full bitwidth implied by the type. >> >> On Fri, Dec 8, 2017 at 11:26 AM, Wes McKinney <[email protected]> wrote: >>> >>> In the case where this arose, the developer had used the UINT_8 >>> ConvertedType to imply a bit width of 8. >>> >>> On Thu, Dec 7, 2017 at 6:53 PM, Ryan Blue <[email protected]> >>> wrote: >>> > Good point. For Parquet Java this is always passed in. I guess this is >>> > using the type's maximum width? If so, I don't think this would be >>> > readable >>> > by other Parquet implementations because there is no place to store the >>> > bit >>> > width. >>> > >>> > On Thu, Dec 7, 2017 at 3:48 PM, Tim Armstrong <[email protected]> >>> > wrote: >>> > >>> >> > Using the RLE encoding will be different from the plain encoding >>> >> > because >>> >> you'd have the overhead bytes for runs and packed sections. We would >>> >> still >>> >> pack int64 values using the width, which is a required parameter. >>> >> How would a reader determine the bit width though? I can't see >>> >> anywhere in >>> >> the format where the bit width is explicitly set. For the RLE level >>> >> decoding it's implied by the max rep/def level. >>> >> >>> >> On Thu, Dec 7, 2017 at 3:31 PM, Ryan Blue <[email protected]> wrote: >>> >> >>> >>> > But if you have a int64 column, do you just store the 64 bit values >>> >>> back-to-back? Is that different from the plain encoding? >>> >>> >>> >>> Using the RLE encoding will be different from the plan encoding >>> >>> because >>> >>> you'd have the overhead bytes for runs and packed sections. We would >>> >>> still >>> >>> pack int64 values using the width, which is a required parameter. >>> >>> >>> >>> > I would suggest that we make a minor revision the format document >>> >>> > to >>> >>> indicate that the RLE encoding is only used for boolean values, >>> >>> dictionary >>> >>> indices (when using dictionary encoding, which is most of the time), >>> >>> and >>> >>> the repetition and definition levels. >>> >>> >>> >>> Unsigned, small integers are actually a good case for using RLE >>> >>> codecs. >>> >>> If you can guarantee that you won't have the msb set unless the >>> >>> number >>> >>> really is large, then why not allow people to use them? >>> >>> >>> >>> rb >>> >>> >>> >>> On Thu, Dec 7, 2017 at 11:33 AM, Tim Armstrong >>> >>> <[email protected]> >>> >>> wrote: >>> >>> >>> >>>> FWIW Impala doesn't support RLE-encoded booleans but it seems like a >>> >>>> reasonable extension. I'm not sure if other readers support that too >>> >>>> in >>> >>>> practice at the moment. >>> >>>> >>> >>>> On Wed, Dec 6, 2017 at 6:19 PM, Wes McKinney <[email protected]> >>> >>>> wrote: >>> >>>> >>> >>>>> I think the issue is that in the library (dask/fastparquet) where >>> >>>>> this >>> >>>>> came up, dictionary encoding in general has not been implemented. >>> >>>>> So >>> >>>>> for unsigned 8-bit integer, since you can use RLE with bit width 8 >>> >>>>> to >>> >>>>> encode such data, this is being used as an alternative to PLAIN >>> >>>>> encoding. But since UINT_8 is only a logical type the annotates >>> >>>>> INT32, >>> >>>>> the RLE encoding as it's defined now cannot be used in general to >>> >>>>> encode INT32. >>> >>>>> >>> >>>>> I would suggest that we make a minor revision the format document >>> >>>>> to >>> >>>>> indicate that the RLE encoding is only used for boolean values, >>> >>>>> dictionary indices (when using dictionary encoding, which is most >>> >>>>> of >>> >>>>> the time), and the repetition and definition levels. >>> >>>>> >>> >>>>> - Wes >>> >>>>> >>> >>>>> On Wed, Dec 6, 2017 at 8:46 PM, Tim Armstrong >>> >>>>> <[email protected]> >>> >>>>> wrote: >>> >>>>> > The current RLE coding has bit-packing baked into it, so I'm >>> >>>>> wondering what >>> >>>>> > it even means to bit-pack a lot of the types, particularly if you >>> >>>>> don't >>> >>>>> > have bounds on the range of values. >>> >>>>> > >>> >>>>> > I can see if you have a logic int8 column stored in an int32, you >>> >>>>> > have >>> >>>>> > bounds on the values, so bit-packing would let you pack things >>> >>>>> > more >>> >>>>> densely >>> >>>>> > >>> >>>>> > But if you have a int64 column, do you just store the 64 bit >>> >>>>> > values >>> >>>>> > back-to-back? Is that different from the plain encoding? Or do >>> >>>>> > you >>> >>>>> select a >>> >>>>> > bitwidth per page and store that in the page header? >>> >>>>> > >>> >>>>> > We also can't bit-pack types like strings at all. >>> >>>>> > >>> >>>>> > I guess based on that and Ryan's observation about negative >>> >>>>> > numbers, >>> >>>>> it >>> >>>>> > sounds like getting a quality RLE encoding for isn't a trivial >>> >>>>> extension of >>> >>>>> > the current encoding and needs some thought. >>> >>>>> > >>> >>>>> > >>> >>>>> > On Wed, Dec 6, 2017 at 2:33 PM, Ryan Blue >>> >>>>> > <[email protected]> >>> >>>>> wrote: >>> >>>>> > >>> >>>>> >> There isn't anything that I know of that would prevent this from >>> >>>>> working. I >>> >>>>> >> think the Java library would even read the data successfully >>> >>>>> >> because >>> >>>>> it >>> >>>>> >> allows pages (usually dictionary-encoded ones) to be RLE >>> >>>>> >> encoded. >>> >>>>> >> >>> >>>>> >> The main problem with this is that the RLE encoding is unaware >>> >>>>> >> of >>> >>>>> negative >>> >>>>> >> values. Any negative number causes the entire data page to be >>> >>>>> >> stored >>> >>>>> with >>> >>>>> >> plain encoding because the most-significant bit is set. So >>> >>>>> >> there's >>> >>>>> just no >>> >>>>> >> benefit to doing it. >>> >>>>> >> >>> >>>>> >> The fact that we don't have an encoding that takes advantage of >>> >>>>> smaller >>> >>>>> >> widths is why I proposed a variant of the RLE codec a while >>> >>>>> >> back. >>> >>>>> >> Basically, it makes all numbers positive by zig-zag encoding >>> >>>>> >> (moving >>> >>>>> the >>> >>>>> >> sign bit to the lsb) and then allows the RLE encoding to change >>> >>>>> packing >>> >>>>> >> width with an extra byte. I think this would be a good one to >>> >>>>> >> add >>> >>>>> for v2, >>> >>>>> >> but this is obviously a separate issue. >>> >>>>> >> >>> >>>>> >> rb >>> >>>>> >> >>> >>>>> >> On Wed, Dec 6, 2017 at 1:58 PM, Wes McKinney >>> >>>>> >> <[email protected]> >>> >>>>> wrote: >>> >>>>> >> >>> >>>>> >> > Sorry, to clarify, in this question: >>> >>>>> >> > >>> >>>>> >> > >>> >>>>> >> > 1) Was RLE (the Hybrid-bitpacked RLE encoder used for >>> >>>>> >> > repetition/definition levels) ever intended for use for >>> >>>>> >> > encoding >>> >>>>> data >>> >>>>> >> > pages in the Parquet V1 format? >>> >>>>> >> > >>> >>>>> >> > I meant for encoding data pages that do not contain dictionary >>> >>>>> indices >>> >>>>> >> > (i.e. as an alternative to PLAIN or >>> >>>>> >> > PLAIN_DICTIONARY/RLE_DICTIONAR >>> >>>>> Y) >>> >>>>> >> > >>> >>>>> >> > On Wed, Dec 6, 2017 at 4:53 PM, Wes McKinney >>> >>>>> >> > <[email protected]> >>> >>>>> >> wrote: >>> >>>>> >> > > We had a discussion recently [1] in which a Python >>> >>>>> implementation of >>> >>>>> >> > > Parquet had used the RLE encoding type for encoding the data >>> >>>>> pages for >>> >>>>> >> > > INT32 values with UINT_8 logical type (non >>> >>>>> >> > > dictionary-encoded). >>> >>>>> >> > > >>> >>>>> >> > > In the Encodings.md document [3] in the Parquet format, it >>> >>>>> >> > > is not >>> >>>>> >> > > strictly indicated that the RLE encoding is to be used for >>> >>>>> >> > > definition/repetition levels and boolean, though that is all >>> >>>>> that is >>> >>>>> >> > > supported in parquet-mr [4], parquet-cpp, Impala [5], and >>> >>>>> >> > > other >>> >>>>> >> > > implementations. >>> >>>>> >> > > >>> >>>>> >> > > So questions: >>> >>>>> >> > > >>> >>>>> >> > > 1) Was RLE (the Hybrid-bitpacked RLE encoder used for >>> >>>>> >> > > repetition/definition levels) ever intended for use for >>> >>>>> >> > > encoding >>> >>>>> data >>> >>>>> >> > > pages in the Parquet V1 format? >>> >>>>> >> > > >>> >>>>> >> > > 2) Whether yes or no, should we update apache/parquet-format >>> >>>>> >> > > to >>> >>>>> be >>> >>>>> >> > > more explicit about the purpose and scope of this encoding? >>> >>>>> >> > > >>> >>>>> >> > > Thanks, >>> >>>>> >> > > Wes >>> >>>>> >> > > >>> >>>>> >> > > [1]: https://github.com/dask/fastparquet/issues/256 >>> >>>>> >> > > [2]: https://github.com/dask/fastparquet >>> >>>>> >> > > [3]: https://github.com/apache/parq >>> >>>>> uet-format/blob/master/Encodings.md >>> >>>>> >> > > [4]: https://github.com/apache/parquet-mr/blob/master/ >>> >>>>> >> > parquet-column/src/main/java/org/apache/parquet/column/ >>> >>>>> >> Encoding.java#L115 >>> >>>>> >> > > [5]: https://github.com/apache/impala/blob/master/be/src/ >>> >>>>> >> > exec/parquet-column-readers.cc#L495 >>> >>>>> >> > >>> >>>>> >> >>> >>>>> >> >>> >>>>> >> >>> >>>>> >> -- >>> >>>>> >> Ryan Blue >>> >>>>> >> Software Engineer >>> >>>>> >> Netflix >>> >>>>> >> >>> >>>>> >>> >>>> >>> >>>> >>> >>> >>> >>> >>> >>> -- >>> >>> Ryan Blue >>> >>> Software Engineer >>> >>> Netflix >>> >>> >>> >> >>> >> >>> > >>> > >>> > -- >>> > Ryan Blue >>> > Software Engineer >>> > Netflix >> >> > > > > -- > Ryan Blue > Software Engineer > Netflix
