Re: Parquet dictionary size limits?

Claire McGinty Fri, 15 Sep 2023 09:33:51 -0700

Thanks so much, Micah!

I think you are using the right setting, but maybe it is possible the
> strings are still exceeding the threshold (perhaps increasing it by 50x or
> more to verify)



I'm glad I was looking at the right setting for dictionary size. I just
tried it out with 10x, 50x, and even total file size, though, and still am
not seeing a dictionary get created. Is it possible it's bounded by file
page size or some other layout option that I need to bump as well?

I haven't seen my discussion during my time in the community but maybe it
> was discussed in the past.  I think the main challenge here is that pages
> are either dictionary encoded or not.  I'd guess to make this practical
> there would need to be a new hybrid page type, which I think it might be an
> interesting idea but quite a bit of work.  Additionally, one would likely
> need heuristics for when to potentially use the new mode versus a complete
> fallback.
>

Got it, thanks for the explanation! It does seem like a huge amount of work


Best,
Claire



On Thu, Sep 14, 2023 at 5:16 PM Micah Kornfield <[email protected]>
wrote:

> >
> > - What's the heuristic for Parquet dictionary writing to succeed for a
> > given column?
>
>
>
> https://github.com/apache/parquet-mr/blob/9b5a962df3007009a227ef421600197531f970a5/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L117
>
>
> > - Is that heuristic configurable at all?
>
>
> I think you are using the right setting, but maybe it is possible the
> strings are still exceeding the threshold (perhaps increasing it by 50x or
> more to verify)
>
>
> > - For high-cardinality datasets, has the idea of a frequency-based
> > dictionary encoding been explored? Say, if the data follows a certain
> > statistical distribution, we can create a dictionary of the most frequent
> > values only?
>
> I haven't seen my discussion during my time in the community but maybe it
> was discussed in the past.  I think the main challenge here is that pages
> are either dictionary encoded or not.  I'd guess to make this practical
> there would need to be a new hybrid page type, which I think it might be an
> interesting idea but quite a bit of work.  Additionally, one would likely
> need heuristics for when to potentially use the new mode versus a complete
> fallback.
>
> Cheers,
> Micah
>
> On Thu, Sep 14, 2023 at 12:07 PM Claire McGinty <
> [email protected]>
> wrote:
>
> > Hi dev@,
> >
> > I'm running some benchmarking on Parquet read/write performance and have
> a
> > few questions about how dictionary encoding works under the hood. Let me
> > know if there's a better channel for this :)
> >
> > My test case uses parquet-avro, where I'm writing a single file
> containing
> > 5 million records. Each record has a single column, an Avro String field
> > (Parquet binary field). I ran two configurations of base setup: in the
> > first case, the string field has 5,000 possible unique values. In the
> > second case, it has 50,000 unique values.
> >
> > In the first case (5k unique values), I used parquet-tools to inspect the
> > file metadata and found that a dictionary had been written:
> >
> > % parquet-tools meta testdata-case1.parquet
> > > file schema:  testdata.TestRecord
> > >
> > >
> >
> --------------------------------------------------------------------------------
> > > stringField:  REQUIRED BINARY L:STRING R:0 D:0
> > > row group 1:  RC:5000001 TS:18262874 OFFSET:4
> > >
> > >
> >
> --------------------------------------------------------------------------------
> > > stringField:   BINARY UNCOMPRESSED DO:4 FPO:38918
> SZ:8181452/8181452/1.00
> > > VC:5000001 ENC:BIT_PACKED,PLAIN_DICTIONARY ST:[min: 0, max: 999,
> > num_nulls:
> > > 0]
> >
> >
> > But in the second case (50k unique values), parquet-tools shows that no
> > dictionary gets created, and the file size is *much* bigger:
> >
> > % parquet-tools meta testdata-case2.parquet
> > > file schema:  testdata.TestRecord
> > >
> > >
> >
> --------------------------------------------------------------------------------
> > > stringField:  REQUIRED BINARY L:STRING R:0 D:0
> > > row group 1:  RC:5000001 TS:18262874 OFFSET:4
> > >
> > >
> >
> --------------------------------------------------------------------------------
> > > stringField:  BINARY UNCOMPRESSED DO:0 FPO:4 SZ:43896278/43896278/1.00
> > > VC:5000001 ENC:PLAIN,BIT_PACKED ST:[min: 0, max: 9999, num_nulls: 0]
> >
> >
> > (I created a gist of my test reproduction here
> > <https://gist.github.com/clairemcginty/c3c0be85f51bc23db45a75e8d8a18806
> >.)
> >
> > Based on this, I'm guessing there's some tip-over point after which
> Parquet
> > will give up on writing a dictionary for a given column? After reading
> > the Configuration
> > docs
> > <
> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md
> > >,
> > I tried increasing the dictionary page size configuration 5x, with the
> same
> > result (no dictionary created).
> >
> > So to summarize, my questions are:
> >
> > - What's the heuristic for Parquet dictionary writing to succeed for a
> > given column?
> > - Is that heuristic configurable at all?
> > - For high-cardinality datasets, has the idea of a frequency-based
> > dictionary encoding been explored? Say, if the data follows a certain
> > statistical distribution, we can create a dictionary of the most frequent
> > values only?
> >
> > Thanks for your time!
> > - Claire
> >
>

Re: Parquet dictionary size limits?

Reply via email to