+1. I think it's rare to have 2 billion values in a single page, plus encoding values too densely causes problems. 2^31-1 should work fine for a maximum. Plus, it maintains forward-compatibility with older readers that are expecting this to be an int.
rb On Tue, May 1, 2018 at 11:36 AM, Tim Armstrong <[email protected]> wrote: > I created https://issues.apache.org/jira/browse/PARQUET-1290. My proposal > is that we retrospectively limit the run lengths to (2^31 - 1). That > technically breaks backwards compatibility but it sounds like none of the > major implementations could read or write such files anyway. > > We could alternatively make it more of a suggestion that a rule and say > that it's a valid Parquet file, but implementations are not required to > support longer run lengths. > > On Tue, May 1, 2018 at 10:08 AM, Ryan Blue <[email protected]> > wrote: > >> It looks like there is no length check in the Parquet Java code: >> https://github.com/apache/parquet-mr/blob/master/parquet- >> column/src/main/java/org/apache/parquet/column/values/ >> rle/RunLengthBitPackingHybridEncoder.java#L242 >> >> But, that uses `writeUnsignedVarInt`, which uses an int: >> https://github.com/apache/parquet-mr/blob/master/parquet- >> common/src/main/java/org/apache/parquet/bytes/BytesUtils.java#L219-L225 >> >> The run length is tracked as an int, so Java would also have a problem if >> the int is 2^31 or greater. An overflow is possible when incrementing or >> when shifting to add the type bit (RLE or bit-packed). Looks like we have >> an effective max of 2^31-1. >> >> rb >> >> On Tue, May 1, 2018 at 3:40 AM, Uwe L. Korn <[email protected]> wrote: >> >> > Hello Tim, >> > >> > taking a brief look at what we have in parquet-cpp (which is probably >> very >> > similar to Impala), we would also have problems with runs that are >> longer >> > than 2^31. While supporting arbitrary long runs might be a really cool >> > feature, I think it will come at a cost that we would have to refactor a >> > lot of code in the current RLE implementations and it may lead to subtle >> > bugs. I would therefore add a maximum run length to the spec. If there >> is >> > really a need for having longer runs, then someone needs to step up and >> > make the changes to the spec and the implementations. As long as there >> is >> > no great need, I don't think we should pay the cost of supporting it. >> > >> > Uwe >> > >> > On Mon, Apr 30, 2018, at 11:18 PM, Tim Armstrong wrote: >> > > I'm looking at Impala bug with decoding Parquet RLE with run lengths >> >= >> > > 2^31. The bug was found by fuzz testing rather than a realistic file. >> > > I'm >> > > trying to determine whether the Parquet spec actually allows runs of >> > > that >> > > length, but Encodings.md does not seem to specify any upper bound. It >> > > mentions ULEB128 encoding, but that can encode arbitrarily large >> > > numbers. >> > > See >> > > https://github.com/apache/parquet-format/blob/master/ >> > Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3 >> > > >> > > Is there a practical limit I can assume? Should we amend the Parquet >> spec >> > > to clarify this? >> > > >> > > The Impala bug is https://issues.apache.org/jira/browse/IMPALA-6946 >> if >> > > anyone is curious. >> > > >> > > Thanks, >> > > Tim >> > >> >> >> >> -- >> Ryan Blue >> Software Engineer >> Netflix >> > > -- Ryan Blue Software Engineer Netflix
