Hello Tim,

taking a brief look at what we have in parquet-cpp (which is probably very 
similar to Impala), we would also have problems with runs that are longer than 
2^31. While supporting arbitrary long runs might be a really cool feature, I 
think it will come at a cost that we would have to refactor a lot of code in 
the current RLE implementations and it may lead to subtle bugs. I would 
therefore add a maximum run length to the spec. If there is really a need for 
having longer runs, then someone needs to step up and make the changes to the 
spec and the implementations. As long as there is no great need, I don't think 
we should pay the cost of supporting it.

Uwe

On Mon, Apr 30, 2018, at 11:18 PM, Tim Armstrong wrote:
> I'm looking at Impala bug with decoding Parquet RLE with run lengths >=
> 2^31. The bug was found by fuzz testing rather than a realistic file. 
> I'm
> trying to determine whether the Parquet spec actually allows runs of 
> that
> length, but Encodings.md does not seem to specify any upper bound. It
> mentions ULEB128 encoding, but that can encode arbitrarily large 
> numbers.
> See
> https://github.com/apache/parquet-format/blob/master/Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3
> 
> Is there a practical limit I can assume? Should we amend the Parquet spec
> to clarify this?
> 
> The Impala bug is https://issues.apache.org/jira/browse/IMPALA-6946 if
> anyone is curious.
> 
> Thanks,
> Tim

Reply via email to