Hello Tim, taking a brief look at what we have in parquet-cpp (which is probably very similar to Impala), we would also have problems with runs that are longer than 2^31. While supporting arbitrary long runs might be a really cool feature, I think it will come at a cost that we would have to refactor a lot of code in the current RLE implementations and it may lead to subtle bugs. I would therefore add a maximum run length to the spec. If there is really a need for having longer runs, then someone needs to step up and make the changes to the spec and the implementations. As long as there is no great need, I don't think we should pay the cost of supporting it.
Uwe On Mon, Apr 30, 2018, at 11:18 PM, Tim Armstrong wrote: > I'm looking at Impala bug with decoding Parquet RLE with run lengths >= > 2^31. The bug was found by fuzz testing rather than a realistic file. > I'm > trying to determine whether the Parquet spec actually allows runs of > that > length, but Encodings.md does not seem to specify any upper bound. It > mentions ULEB128 encoding, but that can encode arbitrarily large > numbers. > See > https://github.com/apache/parquet-format/blob/master/Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3 > > Is there a practical limit I can assume? Should we amend the Parquet spec > to clarify this? > > The Impala bug is https://issues.apache.org/jira/browse/IMPALA-6946 if > anyone is curious. > > Thanks, > Tim
