It looks like there is no length check in the Parquet Java code:
https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/rle/RunLengthBitPackingHybridEncoder.java#L242

But, that uses `writeUnsignedVarInt`, which uses an int:
https://github.com/apache/parquet-mr/blob/master/parquet-common/src/main/java/org/apache/parquet/bytes/BytesUtils.java#L219-L225

The run length is tracked as an int, so Java would also have a problem if
the int is 2^31 or greater. An overflow is possible when incrementing or
when shifting to add the type bit (RLE or bit-packed). Looks like we have
an effective max of 2^31-1.

rb

On Tue, May 1, 2018 at 3:40 AM, Uwe L. Korn <[email protected]> wrote:

> Hello Tim,
>
> taking a brief look at what we have in parquet-cpp (which is probably very
> similar to Impala), we would also have problems with runs that are longer
> than 2^31. While supporting arbitrary long runs might be a really cool
> feature, I think it will come at a cost that we would have to refactor a
> lot of code in the current RLE implementations and it may lead to subtle
> bugs. I would therefore add a maximum run length to the spec. If there is
> really a need for having longer runs, then someone needs to step up and
> make the changes to the spec and the implementations. As long as there is
> no great need, I don't think we should pay the cost of supporting it.
>
> Uwe
>
> On Mon, Apr 30, 2018, at 11:18 PM, Tim Armstrong wrote:
> > I'm looking at Impala bug with decoding Parquet RLE with run lengths >=
> > 2^31. The bug was found by fuzz testing rather than a realistic file.
> > I'm
> > trying to determine whether the Parquet spec actually allows runs of
> > that
> > length, but Encodings.md does not seem to specify any upper bound. It
> > mentions ULEB128 encoding, but that can encode arbitrarily large
> > numbers.
> > See
> > https://github.com/apache/parquet-format/blob/master/
> Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3
> >
> > Is there a practical limit I can assume? Should we amend the Parquet spec
> > to clarify this?
> >
> > The Impala bug is https://issues.apache.org/jira/browse/IMPALA-6946 if
> > anyone is curious.
> >
> > Thanks,
> > Tim
>



-- 
Ryan Blue
Software Engineer
Netflix

Reply via email to