I'm looking at Impala bug with decoding Parquet RLE with run lengths >=
2^31. The bug was found by fuzz testing rather than a realistic file. I'm
trying to determine whether the Parquet spec actually allows runs of that
length, but Encodings.md does not seem to specify any upper bound. It
mentions ULEB128 encoding, but that can encode arbitrarily large numbers.
See
https://github.com/apache/parquet-format/blob/master/Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3

Is there a practical limit I can assume? Should we amend the Parquet spec
to clarify this?

The Impala bug is https://issues.apache.org/jira/browse/IMPALA-6946 if
anyone is curious.

Thanks,
Tim

Reply via email to