There isn't anything that I know of that would prevent this from working. I think the Java library would even read the data successfully because it allows pages (usually dictionary-encoded ones) to be RLE encoded.
The main problem with this is that the RLE encoding is unaware of negative values. Any negative number causes the entire data page to be stored with plain encoding because the most-significant bit is set. So there's just no benefit to doing it. The fact that we don't have an encoding that takes advantage of smaller widths is why I proposed a variant of the RLE codec a while back. Basically, it makes all numbers positive by zig-zag encoding (moving the sign bit to the lsb) and then allows the RLE encoding to change packing width with an extra byte. I think this would be a good one to add for v2, but this is obviously a separate issue. rb On Wed, Dec 6, 2017 at 1:58 PM, Wes McKinney <[email protected]> wrote: > Sorry, to clarify, in this question: > > > 1) Was RLE (the Hybrid-bitpacked RLE encoder used for > repetition/definition levels) ever intended for use for encoding data > pages in the Parquet V1 format? > > I meant for encoding data pages that do not contain dictionary indices > (i.e. as an alternative to PLAIN or PLAIN_DICTIONARY/RLE_DICTIONARY) > > On Wed, Dec 6, 2017 at 4:53 PM, Wes McKinney <[email protected]> wrote: > > We had a discussion recently [1] in which a Python implementation of > > Parquet had used the RLE encoding type for encoding the data pages for > > INT32 values with UINT_8 logical type (non dictionary-encoded). > > > > In the Encodings.md document [3] in the Parquet format, it is not > > strictly indicated that the RLE encoding is to be used for > > definition/repetition levels and boolean, though that is all that is > > supported in parquet-mr [4], parquet-cpp, Impala [5], and other > > implementations. > > > > So questions: > > > > 1) Was RLE (the Hybrid-bitpacked RLE encoder used for > > repetition/definition levels) ever intended for use for encoding data > > pages in the Parquet V1 format? > > > > 2) Whether yes or no, should we update apache/parquet-format to be > > more explicit about the purpose and scope of this encoding? > > > > Thanks, > > Wes > > > > [1]: https://github.com/dask/fastparquet/issues/256 > > [2]: https://github.com/dask/fastparquet > > [3]: https://github.com/apache/parquet-format/blob/master/Encodings.md > > [4]: https://github.com/apache/parquet-mr/blob/master/ > parquet-column/src/main/java/org/apache/parquet/column/Encoding.java#L115 > > [5]: https://github.com/apache/impala/blob/master/be/src/ > exec/parquet-column-readers.cc#L495 > -- Ryan Blue Software Engineer Netflix
