Re: Parquet dictionary size limits?

Micah Kornfield Fri, 15 Sep 2023 09:54:12 -0700

I'll note there is also a check for encoding effectiveness [1] that could
come into play but I'd guess that isn't the case here.


[1]
https://github.com/apache/parquet-mr/blob/9b5a962df3007009a227ef421600197531f970a5/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L124

On Fri, Sep 15, 2023 at 9:51 AM Micah Kornfield <[email protected]>
wrote:

> I'm glad I was looking at the right setting for dictionary size. I just
>> tried it out with 10x, 50x, and even total file size, though, and still am
>> not seeing a dictionary get created. Is it possible it's bounded by file
>> page size or some other layout option that I need to bump as well?
>
>
> Sorry I'm less familiar with parquet-mr, hopefully someone else to chime
> in.  If I had to guess, maybe somehow the config value isn't making it to
> the writer (but there could also be something else at play).
>
> On Fri, Sep 15, 2023 at 9:33 AM Claire McGinty <[email protected]>
> wrote:
>
>> Thanks so much, Micah!
>>
>> I think you are using the right setting, but maybe it is possible the
>> > strings are still exceeding the threshold (perhaps increasing it by 50x
>> or
>> > more to verify)
>>
>>
>> I'm glad I was looking at the right setting for dictionary size. I just
>> tried it out with 10x, 50x, and even total file size, though, and still am
>> not seeing a dictionary get created. Is it possible it's bounded by file
>> page size or some other layout option that I need to bump as well?
>>
>> I haven't seen my discussion during my time in the community but maybe it
>> > was discussed in the past.  I think the main challenge here is that
>> pages
>> > are either dictionary encoded or not.  I'd guess to make this practical
>> > there would need to be a new hybrid page type, which I think it might
>> be an
>> > interesting idea but quite a bit of work.  Additionally, one would
>> likely
>> > need heuristics for when to potentially use the new mode versus a
>> complete
>> > fallback.
>> >
>>
>> Got it, thanks for the explanation! It does seem like a huge amount of
>> work
>>
>>
>> Best,
>> Claire
>>
>>
>>
>> On Thu, Sep 14, 2023 at 5:16 PM Micah Kornfield <[email protected]>
>> wrote:
>>
>> > >
>> > > - What's the heuristic for Parquet dictionary writing to succeed for a
>> > > given column?
>> >
>> >
>> >
>> >
>> https://github.com/apache/parquet-mr/blob/9b5a962df3007009a227ef421600197531f970a5/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L117
>> >
>> >
>> > > - Is that heuristic configurable at all?
>> >
>> >
>> > I think you are using the right setting, but maybe it is possible the
>> > strings are still exceeding the threshold (perhaps increasing it by 50x
>> or
>> > more to verify)
>> >
>> >
>> > > - For high-cardinality datasets, has the idea of a frequency-based
>> > > dictionary encoding been explored? Say, if the data follows a certain
>> > > statistical distribution, we can create a dictionary of the most
>> frequent
>> > > values only?
>> >
>> > I haven't seen my discussion during my time in the community but maybe
>> it
>> > was discussed in the past.  I think the main challenge here is that
>> pages
>> > are either dictionary encoded or not.  I'd guess to make this practical
>> > there would need to be a new hybrid page type, which I think it might
>> be an
>> > interesting idea but quite a bit of work.  Additionally, one would
>> likely
>> > need heuristics for when to potentially use the new mode versus a
>> complete
>> > fallback.
>> >
>> > Cheers,
>> > Micah
>> >
>> > On Thu, Sep 14, 2023 at 12:07 PM Claire McGinty <
>> > [email protected]>
>> > wrote:
>> >
>> > > Hi dev@,
>> > >
>> > > I'm running some benchmarking on Parquet read/write performance and
>> have
>> > a
>> > > few questions about how dictionary encoding works under the hood. Let
>> me
>> > > know if there's a better channel for this :)
>> > >
>> > > My test case uses parquet-avro, where I'm writing a single file
>> > containing
>> > > 5 million records. Each record has a single column, an Avro String
>> field
>> > > (Parquet binary field). I ran two configurations of base setup: in the
>> > > first case, the string field has 5,000 possible unique values. In the
>> > > second case, it has 50,000 unique values.
>> > >
>> > > In the first case (5k unique values), I used parquet-tools to inspect
>> the
>> > > file metadata and found that a dictionary had been written:
>> > >
>> > > % parquet-tools meta testdata-case1.parquet
>> > > > file schema:  testdata.TestRecord
>> > > >
>> > > >
>> > >
>> >
>> --------------------------------------------------------------------------------
>> > > > stringField:  REQUIRED BINARY L:STRING R:0 D:0
>> > > > row group 1:  RC:5000001 TS:18262874 OFFSET:4
>> > > >
>> > > >
>> > >
>> >
>> --------------------------------------------------------------------------------
>> > > > stringField:   BINARY UNCOMPRESSED DO:4 FPO:38918
>> > SZ:8181452/8181452/1.00
>> > > > VC:5000001 ENC:BIT_PACKED,PLAIN_DICTIONARY ST:[min: 0, max: 999,
>> > > num_nulls:
>> > > > 0]
>> > >
>> > >
>> > > But in the second case (50k unique values), parquet-tools shows that
>> no
>> > > dictionary gets created, and the file size is *much* bigger:
>> > >
>> > > % parquet-tools meta testdata-case2.parquet
>> > > > file schema:  testdata.TestRecord
>> > > >
>> > > >
>> > >
>> >
>> --------------------------------------------------------------------------------
>> > > > stringField:  REQUIRED BINARY L:STRING R:0 D:0
>> > > > row group 1:  RC:5000001 TS:18262874 OFFSET:4
>> > > >
>> > > >
>> > >
>> >
>> --------------------------------------------------------------------------------
>> > > > stringField:  BINARY UNCOMPRESSED DO:0 FPO:4
>> SZ:43896278/43896278/1.00
>> > > > VC:5000001 ENC:PLAIN,BIT_PACKED ST:[min: 0, max: 9999, num_nulls: 0]
>> > >
>> > >
>> > > (I created a gist of my test reproduction here
>> > > <
>> https://gist.github.com/clairemcginty/c3c0be85f51bc23db45a75e8d8a18806
>> > >.)
>> > >
>> > > Based on this, I'm guessing there's some tip-over point after which
>> > Parquet
>> > > will give up on writing a dictionary for a given column? After reading
>> > > the Configuration
>> > > docs
>> > > <
>> >
>> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md
>> > > >,
>> > > I tried increasing the dictionary page size configuration 5x, with the
>> > same
>> > > result (no dictionary created).
>> > >
>> > > So to summarize, my questions are:
>> > >
>> > > - What's the heuristic for Parquet dictionary writing to succeed for a
>> > > given column?
>> > > - Is that heuristic configurable at all?
>> > > - For high-cardinality datasets, has the idea of a frequency-based
>> > > dictionary encoding been explored? Say, if the data follows a certain
>> > > statistical distribution, we can create a dictionary of the most
>> frequent
>> > > values only?
>> > >
>> > > Thanks for your time!
>> > > - Claire
>> > >
>> >
>>
>

Re: Parquet dictionary size limits?

Reply via email to