I created https://issues.apache.org/jira/browse/PARQUET-1290. My proposal
is that we retrospectively limit the run lengths to (2^31 - 1). That
technically breaks backwards compatibility but it sounds like none of the
major implementations could read or write such files anyway.

We could alternatively make it more of a suggestion that a rule and say
that it's a valid Parquet file, but implementations are not required to
support longer run lengths.

On Tue, May 1, 2018 at 10:08 AM, Ryan Blue <[email protected]>
wrote:

> It looks like there is no length check in the Parquet Java code:
> https://github.com/apache/parquet-mr/blob/master/
> parquet-column/src/main/java/org/apache/parquet/column/values/rle/
> RunLengthBitPackingHybridEncoder.java#L242
>
> But, that uses `writeUnsignedVarInt`, which uses an int:
> https://github.com/apache/parquet-mr/blob/master/
> parquet-common/src/main/java/org/apache/parquet/bytes/
> BytesUtils.java#L219-L225
>
> The run length is tracked as an int, so Java would also have a problem if
> the int is 2^31 or greater. An overflow is possible when incrementing or
> when shifting to add the type bit (RLE or bit-packed). Looks like we have
> an effective max of 2^31-1.
>
> rb
>
> On Tue, May 1, 2018 at 3:40 AM, Uwe L. Korn <[email protected]> wrote:
>
> > Hello Tim,
> >
> > taking a brief look at what we have in parquet-cpp (which is probably
> very
> > similar to Impala), we would also have problems with runs that are longer
> > than 2^31. While supporting arbitrary long runs might be a really cool
> > feature, I think it will come at a cost that we would have to refactor a
> > lot of code in the current RLE implementations and it may lead to subtle
> > bugs. I would therefore add a maximum run length to the spec. If there is
> > really a need for having longer runs, then someone needs to step up and
> > make the changes to the spec and the implementations. As long as there is
> > no great need, I don't think we should pay the cost of supporting it.
> >
> > Uwe
> >
> > On Mon, Apr 30, 2018, at 11:18 PM, Tim Armstrong wrote:
> > > I'm looking at Impala bug with decoding Parquet RLE with run lengths >=
> > > 2^31. The bug was found by fuzz testing rather than a realistic file.
> > > I'm
> > > trying to determine whether the Parquet spec actually allows runs of
> > > that
> > > length, but Encodings.md does not seem to specify any upper bound. It
> > > mentions ULEB128 encoding, but that can encode arbitrarily large
> > > numbers.
> > > See
> > > https://github.com/apache/parquet-format/blob/master/
> > Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3
> > >
> > > Is there a practical limit I can assume? Should we amend the Parquet
> spec
> > > to clarify this?
> > >
> > > The Impala bug is https://issues.apache.org/jira/browse/IMPALA-6946 if
> > > anyone is curious.
> > >
> > > Thanks,
> > > Tim
> >
>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Reply via email to