+1. I think it's rare to have 2 billion values in a single page, plus
encoding values too densely causes problems. 2^31-1 should work fine for a
maximum. Plus, it maintains forward-compatibility with older readers that
are expecting this to be an int.

rb

On Tue, May 1, 2018 at 11:36 AM, Tim Armstrong <[email protected]>
wrote:

> I created https://issues.apache.org/jira/browse/PARQUET-1290. My proposal
> is that we retrospectively limit the run lengths to (2^31 - 1). That
> technically breaks backwards compatibility but it sounds like none of the
> major implementations could read or write such files anyway.
>
> We could alternatively make it more of a suggestion that a rule and say
> that it's a valid Parquet file, but implementations are not required to
> support longer run lengths.
>
> On Tue, May 1, 2018 at 10:08 AM, Ryan Blue <[email protected]>
> wrote:
>
>> It looks like there is no length check in the Parquet Java code:
>> https://github.com/apache/parquet-mr/blob/master/parquet-
>> column/src/main/java/org/apache/parquet/column/values/
>> rle/RunLengthBitPackingHybridEncoder.java#L242
>>
>> But, that uses `writeUnsignedVarInt`, which uses an int:
>> https://github.com/apache/parquet-mr/blob/master/parquet-
>> common/src/main/java/org/apache/parquet/bytes/BytesUtils.java#L219-L225
>>
>> The run length is tracked as an int, so Java would also have a problem if
>> the int is 2^31 or greater. An overflow is possible when incrementing or
>> when shifting to add the type bit (RLE or bit-packed). Looks like we have
>> an effective max of 2^31-1.
>>
>> rb
>>
>> On Tue, May 1, 2018 at 3:40 AM, Uwe L. Korn <[email protected]> wrote:
>>
>> > Hello Tim,
>> >
>> > taking a brief look at what we have in parquet-cpp (which is probably
>> very
>> > similar to Impala), we would also have problems with runs that are
>> longer
>> > than 2^31. While supporting arbitrary long runs might be a really cool
>> > feature, I think it will come at a cost that we would have to refactor a
>> > lot of code in the current RLE implementations and it may lead to subtle
>> > bugs. I would therefore add a maximum run length to the spec. If there
>> is
>> > really a need for having longer runs, then someone needs to step up and
>> > make the changes to the spec and the implementations. As long as there
>> is
>> > no great need, I don't think we should pay the cost of supporting it.
>> >
>> > Uwe
>> >
>> > On Mon, Apr 30, 2018, at 11:18 PM, Tim Armstrong wrote:
>> > > I'm looking at Impala bug with decoding Parquet RLE with run lengths
>> >=
>> > > 2^31. The bug was found by fuzz testing rather than a realistic file.
>> > > I'm
>> > > trying to determine whether the Parquet spec actually allows runs of
>> > > that
>> > > length, but Encodings.md does not seem to specify any upper bound. It
>> > > mentions ULEB128 encoding, but that can encode arbitrarily large
>> > > numbers.
>> > > See
>> > > https://github.com/apache/parquet-format/blob/master/
>> > Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3
>> > >
>> > > Is there a practical limit I can assume? Should we amend the Parquet
>> spec
>> > > to clarify this?
>> > >
>> > > The Impala bug is https://issues.apache.org/jira/browse/IMPALA-6946
>> if
>> > > anyone is curious.
>> > >
>> > > Thanks,
>> > > Tim
>> >
>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>


-- 
Ryan Blue
Software Engineer
Netflix

Reply via email to