I created https://issues.apache.org/jira/browse/PARQUET-1290. My proposal is that we retrospectively limit the run lengths to (2^31 - 1). That technically breaks backwards compatibility but it sounds like none of the major implementations could read or write such files anyway.
We could alternatively make it more of a suggestion that a rule and say that it's a valid Parquet file, but implementations are not required to support longer run lengths. On Tue, May 1, 2018 at 10:08 AM, Ryan Blue <[email protected]> wrote: > It looks like there is no length check in the Parquet Java code: > https://github.com/apache/parquet-mr/blob/master/ > parquet-column/src/main/java/org/apache/parquet/column/values/rle/ > RunLengthBitPackingHybridEncoder.java#L242 > > But, that uses `writeUnsignedVarInt`, which uses an int: > https://github.com/apache/parquet-mr/blob/master/ > parquet-common/src/main/java/org/apache/parquet/bytes/ > BytesUtils.java#L219-L225 > > The run length is tracked as an int, so Java would also have a problem if > the int is 2^31 or greater. An overflow is possible when incrementing or > when shifting to add the type bit (RLE or bit-packed). Looks like we have > an effective max of 2^31-1. > > rb > > On Tue, May 1, 2018 at 3:40 AM, Uwe L. Korn <[email protected]> wrote: > > > Hello Tim, > > > > taking a brief look at what we have in parquet-cpp (which is probably > very > > similar to Impala), we would also have problems with runs that are longer > > than 2^31. While supporting arbitrary long runs might be a really cool > > feature, I think it will come at a cost that we would have to refactor a > > lot of code in the current RLE implementations and it may lead to subtle > > bugs. I would therefore add a maximum run length to the spec. If there is > > really a need for having longer runs, then someone needs to step up and > > make the changes to the spec and the implementations. As long as there is > > no great need, I don't think we should pay the cost of supporting it. > > > > Uwe > > > > On Mon, Apr 30, 2018, at 11:18 PM, Tim Armstrong wrote: > > > I'm looking at Impala bug with decoding Parquet RLE with run lengths >= > > > 2^31. The bug was found by fuzz testing rather than a realistic file. > > > I'm > > > trying to determine whether the Parquet spec actually allows runs of > > > that > > > length, but Encodings.md does not seem to specify any upper bound. It > > > mentions ULEB128 encoding, but that can encode arbitrarily large > > > numbers. > > > See > > > https://github.com/apache/parquet-format/blob/master/ > > Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3 > > > > > > Is there a practical limit I can assume? Should we amend the Parquet > spec > > > to clarify this? > > > > > > The Impala bug is https://issues.apache.org/jira/browse/IMPALA-6946 if > > > anyone is curious. > > > > > > Thanks, > > > Tim > > > > > > -- > Ryan Blue > Software Engineer > Netflix >
