It looks like there is no length check in the Parquet Java code: https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/rle/RunLengthBitPackingHybridEncoder.java#L242
But, that uses `writeUnsignedVarInt`, which uses an int: https://github.com/apache/parquet-mr/blob/master/parquet-common/src/main/java/org/apache/parquet/bytes/BytesUtils.java#L219-L225 The run length is tracked as an int, so Java would also have a problem if the int is 2^31 or greater. An overflow is possible when incrementing or when shifting to add the type bit (RLE or bit-packed). Looks like we have an effective max of 2^31-1. rb On Tue, May 1, 2018 at 3:40 AM, Uwe L. Korn <[email protected]> wrote: > Hello Tim, > > taking a brief look at what we have in parquet-cpp (which is probably very > similar to Impala), we would also have problems with runs that are longer > than 2^31. While supporting arbitrary long runs might be a really cool > feature, I think it will come at a cost that we would have to refactor a > lot of code in the current RLE implementations and it may lead to subtle > bugs. I would therefore add a maximum run length to the spec. If there is > really a need for having longer runs, then someone needs to step up and > make the changes to the spec and the implementations. As long as there is > no great need, I don't think we should pay the cost of supporting it. > > Uwe > > On Mon, Apr 30, 2018, at 11:18 PM, Tim Armstrong wrote: > > I'm looking at Impala bug with decoding Parquet RLE with run lengths >= > > 2^31. The bug was found by fuzz testing rather than a realistic file. > > I'm > > trying to determine whether the Parquet spec actually allows runs of > > that > > length, but Encodings.md does not seem to specify any upper bound. It > > mentions ULEB128 encoding, but that can encode arbitrarily large > > numbers. > > See > > https://github.com/apache/parquet-format/blob/master/ > Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3 > > > > Is there a practical limit I can assume? Should we amend the Parquet spec > > to clarify this? > > > > The Impala bug is https://issues.apache.org/jira/browse/IMPALA-6946 if > > anyone is curious. > > > > Thanks, > > Tim > -- Ryan Blue Software Engineer Netflix
