Oh, interesting! I'm setting it via the
ParquetWriter#withDictionaryPageSize method, and I do see the overall file
size increasing when I bump the value. I'll look into it a bit more -- it
would be helpful for some cases where the # unique values in a column is
just over the size limit.

- Claire

On Fri, Sep 15, 2023 at 9:54 AM Micah Kornfield <emkornfi...@gmail.com>
wrote:

> I'll note there is also a check for encoding effectiveness [1] that could
> come into play but I'd guess that isn't the case here.
>
> [1]
>
> https://github.com/apache/parquet-mr/blob/9b5a962df3007009a227ef421600197531f970a5/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L124
>
> On Fri, Sep 15, 2023 at 9:51 AM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
>
> > I'm glad I was looking at the right setting for dictionary size. I just
> >> tried it out with 10x, 50x, and even total file size, though, and still
> am
> >> not seeing a dictionary get created. Is it possible it's bounded by file
> >> page size or some other layout option that I need to bump as well?
> >
> >
> > Sorry I'm less familiar with parquet-mr, hopefully someone else to chime
> > in.  If I had to guess, maybe somehow the config value isn't making it to
> > the writer (but there could also be something else at play).
> >
> > On Fri, Sep 15, 2023 at 9:33 AM Claire McGinty <
> claire.d.mcgi...@gmail.com>
> > wrote:
> >
> >> Thanks so much, Micah!
> >>
> >> I think you are using the right setting, but maybe it is possible the
> >> > strings are still exceeding the threshold (perhaps increasing it by
> 50x
> >> or
> >> > more to verify)
> >>
> >>
> >> I'm glad I was looking at the right setting for dictionary size. I just
> >> tried it out with 10x, 50x, and even total file size, though, and still
> am
> >> not seeing a dictionary get created. Is it possible it's bounded by file
> >> page size or some other layout option that I need to bump as well?
> >>
> >> I haven't seen my discussion during my time in the community but maybe
> it
> >> > was discussed in the past.  I think the main challenge here is that
> >> pages
> >> > are either dictionary encoded or not.  I'd guess to make this
> practical
> >> > there would need to be a new hybrid page type, which I think it might
> >> be an
> >> > interesting idea but quite a bit of work.  Additionally, one would
> >> likely
> >> > need heuristics for when to potentially use the new mode versus a
> >> complete
> >> > fallback.
> >> >
> >>
> >> Got it, thanks for the explanation! It does seem like a huge amount of
> >> work
> >>
> >>
> >> Best,
> >> Claire
> >>
> >>
> >>
> >> On Thu, Sep 14, 2023 at 5:16 PM Micah Kornfield <emkornfi...@gmail.com>
> >> wrote:
> >>
> >> > >
> >> > > - What's the heuristic for Parquet dictionary writing to succeed
> for a
> >> > > given column?
> >> >
> >> >
> >> >
> >> >
> >>
> https://github.com/apache/parquet-mr/blob/9b5a962df3007009a227ef421600197531f970a5/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L117
> >> >
> >> >
> >> > > - Is that heuristic configurable at all?
> >> >
> >> >
> >> > I think you are using the right setting, but maybe it is possible the
> >> > strings are still exceeding the threshold (perhaps increasing it by
> 50x
> >> or
> >> > more to verify)
> >> >
> >> >
> >> > > - For high-cardinality datasets, has the idea of a frequency-based
> >> > > dictionary encoding been explored? Say, if the data follows a
> certain
> >> > > statistical distribution, we can create a dictionary of the most
> >> frequent
> >> > > values only?
> >> >
> >> > I haven't seen my discussion during my time in the community but maybe
> >> it
> >> > was discussed in the past.  I think the main challenge here is that
> >> pages
> >> > are either dictionary encoded or not.  I'd guess to make this
> practical
> >> > there would need to be a new hybrid page type, which I think it might
> >> be an
> >> > interesting idea but quite a bit of work.  Additionally, one would
> >> likely
> >> > need heuristics for when to potentially use the new mode versus a
> >> complete
> >> > fallback.
> >> >
> >> > Cheers,
> >> > Micah
> >> >
> >> > On Thu, Sep 14, 2023 at 12:07 PM Claire McGinty <
> >> > claire.d.mcgi...@gmail.com>
> >> > wrote:
> >> >
> >> > > Hi dev@,
> >> > >
> >> > > I'm running some benchmarking on Parquet read/write performance and
> >> have
> >> > a
> >> > > few questions about how dictionary encoding works under the hood.
> Let
> >> me
> >> > > know if there's a better channel for this :)
> >> > >
> >> > > My test case uses parquet-avro, where I'm writing a single file
> >> > containing
> >> > > 5 million records. Each record has a single column, an Avro String
> >> field
> >> > > (Parquet binary field). I ran two configurations of base setup: in
> the
> >> > > first case, the string field has 5,000 possible unique values. In
> the
> >> > > second case, it has 50,000 unique values.
> >> > >
> >> > > In the first case (5k unique values), I used parquet-tools to
> inspect
> >> the
> >> > > file metadata and found that a dictionary had been written:
> >> > >
> >> > > % parquet-tools meta testdata-case1.parquet
> >> > > > file schema:  testdata.TestRecord
> >> > > >
> >> > > >
> >> > >
> >> >
> >>
> --------------------------------------------------------------------------------
> >> > > > stringField:  REQUIRED BINARY L:STRING R:0 D:0
> >> > > > row group 1:  RC:5000001 TS:18262874 OFFSET:4
> >> > > >
> >> > > >
> >> > >
> >> >
> >>
> --------------------------------------------------------------------------------
> >> > > > stringField:   BINARY UNCOMPRESSED DO:4 FPO:38918
> >> > SZ:8181452/8181452/1.00
> >> > > > VC:5000001 ENC:BIT_PACKED,PLAIN_DICTIONARY ST:[min: 0, max: 999,
> >> > > num_nulls:
> >> > > > 0]
> >> > >
> >> > >
> >> > > But in the second case (50k unique values), parquet-tools shows that
> >> no
> >> > > dictionary gets created, and the file size is *much* bigger:
> >> > >
> >> > > % parquet-tools meta testdata-case2.parquet
> >> > > > file schema:  testdata.TestRecord
> >> > > >
> >> > > >
> >> > >
> >> >
> >>
> --------------------------------------------------------------------------------
> >> > > > stringField:  REQUIRED BINARY L:STRING R:0 D:0
> >> > > > row group 1:  RC:5000001 TS:18262874 OFFSET:4
> >> > > >
> >> > > >
> >> > >
> >> >
> >>
> --------------------------------------------------------------------------------
> >> > > > stringField:  BINARY UNCOMPRESSED DO:0 FPO:4
> >> SZ:43896278/43896278/1.00
> >> > > > VC:5000001 ENC:PLAIN,BIT_PACKED ST:[min: 0, max: 9999, num_nulls:
> 0]
> >> > >
> >> > >
> >> > > (I created a gist of my test reproduction here
> >> > > <
> >> https://gist.github.com/clairemcginty/c3c0be85f51bc23db45a75e8d8a18806
> >> > >.)
> >> > >
> >> > > Based on this, I'm guessing there's some tip-over point after which
> >> > Parquet
> >> > > will give up on writing a dictionary for a given column? After
> reading
> >> > > the Configuration
> >> > > docs
> >> > > <
> >> >
> >>
> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md
> >> > > >,
> >> > > I tried increasing the dictionary page size configuration 5x, with
> the
> >> > same
> >> > > result (no dictionary created).
> >> > >
> >> > > So to summarize, my questions are:
> >> > >
> >> > > - What's the heuristic for Parquet dictionary writing to succeed
> for a
> >> > > given column?
> >> > > - Is that heuristic configurable at all?
> >> > > - For high-cardinality datasets, has the idea of a frequency-based
> >> > > dictionary encoding been explored? Say, if the data follows a
> certain
> >> > > statistical distribution, we can create a dictionary of the most
> >> frequent
> >> > > values only?
> >> > >
> >> > > Thanks for your time!
> >> > > - Claire
> >> > >
> >> >
> >>
> >
>

Reply via email to