Claire, thank you for your research and examples on this topic, I've
learned a lot.  My hunch is that your change would be a good one, but I'm
not an expert (and more to the point, not a committer).  I'm looking
forward to learning more as this discussion continues.

Thank you again, Aaron

On Tue, Sep 19, 2023 at 2:48 PM Claire McGinty <claire.d.mcgi...@gmail.com>
wrote:

> I created a quick branch
> <
> https://github.com/apache/parquet-mr/compare/master...clairemcginty:parquet-mr:dict-size-repro?expand=1
> >
> to reproduce what I'm seeing -- the test shows that an Int column with
> cardinality 100 successfully results in a dict encoding, but an int column
> with cardinality 10,000 falls back and doesn't create a dict encoding. This
> seems like a low threshold given the 1MB dictionary page size, so I just
> wanted to check whether this is expected or not :)
>
> Best,
> Claire
>
> On Tue, Sep 19, 2023 at 9:35 AM Claire McGinty <claire.d.mcgi...@gmail.com
> >
> wrote:
>
> > Hi, just wanted to follow up on this!
> >
> > I ran a debugger to find out why my column wasn't ending up with a
> > dictionary encoding and it turns out that even though
> > DictionaryValuesWriter#shouldFallback()
> > <
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L117
> >
> > always returned false (dictionaryByteSize was always less than my
> > configured page size), DictionaryValuesWriter#isCompressionSatisfying
> > <
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L125>
> was
> > what was causing Parquet to switch
> > <
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/fallback/FallbackValuesWriter.java#L75
> >
> > back to the fallback, non-dict writer.
> >
> > From what I can tell, this check compares the total byte size of
> > *every* element with the byte size of each *distinct* element as a kind
> of
> > proxy for encoding efficiency.... however, it seems strange that this
> check
> > can cause the writer to fall back even if the total encoded dict size is
> > far below the configured dictionary page size. Out of curiosity, I
> modified
> > DictionaryValuesWriter#isCompressionSatisfying
> > <
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L125>
> to
> > also check whether total byte size was less than dictionary max size and
> > re-ran my Parquet write with a local snapshot, and my file size dropped
> 50%.
> >
> > Best,
> > Claire
> >
> > On Mon, Sep 18, 2023 at 9:16 AM Claire McGinty <
> claire.d.mcgi...@gmail.com>
> > wrote:
> >
> >> Oh, interesting! I'm setting it via the
> >> ParquetWriter#withDictionaryPageSize method, and I do see the overall
> file
> >> size increasing when I bump the value. I'll look into it a bit more --
> it
> >> would be helpful for some cases where the # unique values in a column is
> >> just over the size limit.
> >>
> >> - Claire
> >>
> >> On Fri, Sep 15, 2023 at 9:54 AM Micah Kornfield <emkornfi...@gmail.com>
> >> wrote:
> >>
> >>> I'll note there is also a check for encoding effectiveness [1] that
> could
> >>> come into play but I'd guess that isn't the case here.
> >>>
> >>> [1]
> >>>
> >>>
> https://github.com/apache/parquet-mr/blob/9b5a962df3007009a227ef421600197531f970a5/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L124
> >>>
> >>> On Fri, Sep 15, 2023 at 9:51 AM Micah Kornfield <emkornfi...@gmail.com
> >
> >>> wrote:
> >>>
> >>> > I'm glad I was looking at the right setting for dictionary size. I
> just
> >>> >> tried it out with 10x, 50x, and even total file size, though, and
> >>> still am
> >>> >> not seeing a dictionary get created. Is it possible it's bounded by
> >>> file
> >>> >> page size or some other layout option that I need to bump as well?
> >>> >
> >>> >
> >>> > Sorry I'm less familiar with parquet-mr, hopefully someone else to
> >>> chime
> >>> > in.  If I had to guess, maybe somehow the config value isn't making
> it
> >>> to
> >>> > the writer (but there could also be something else at play).
> >>> >
> >>> > On Fri, Sep 15, 2023 at 9:33 AM Claire McGinty <
> >>> claire.d.mcgi...@gmail.com>
> >>> > wrote:
> >>> >
> >>> >> Thanks so much, Micah!
> >>> >>
> >>> >> I think you are using the right setting, but maybe it is possible
> the
> >>> >> > strings are still exceeding the threshold (perhaps increasing it
> by
> >>> 50x
> >>> >> or
> >>> >> > more to verify)
> >>> >>
> >>> >>
> >>> >> I'm glad I was looking at the right setting for dictionary size. I
> >>> just
> >>> >> tried it out with 10x, 50x, and even total file size, though, and
> >>> still am
> >>> >> not seeing a dictionary get created. Is it possible it's bounded by
> >>> file
> >>> >> page size or some other layout option that I need to bump as well?
> >>> >>
> >>> >> I haven't seen my discussion during my time in the community but
> >>> maybe it
> >>> >> > was discussed in the past.  I think the main challenge here is
> that
> >>> >> pages
> >>> >> > are either dictionary encoded or not.  I'd guess to make this
> >>> practical
> >>> >> > there would need to be a new hybrid page type, which I think it
> >>> might
> >>> >> be an
> >>> >> > interesting idea but quite a bit of work.  Additionally, one would
> >>> >> likely
> >>> >> > need heuristics for when to potentially use the new mode versus a
> >>> >> complete
> >>> >> > fallback.
> >>> >> >
> >>> >>
> >>> >> Got it, thanks for the explanation! It does seem like a huge amount
> of
> >>> >> work
> >>> >>
> >>> >>
> >>> >> Best,
> >>> >> Claire
> >>> >>
> >>> >>
> >>> >>
> >>> >> On Thu, Sep 14, 2023 at 5:16 PM Micah Kornfield <
> >>> emkornfi...@gmail.com>
> >>> >> wrote:
> >>> >>
> >>> >> > >
> >>> >> > > - What's the heuristic for Parquet dictionary writing to succeed
> >>> for a
> >>> >> > > given column?
> >>> >> >
> >>> >> >
> >>> >> >
> >>> >> >
> >>> >>
> >>>
> https://github.com/apache/parquet-mr/blob/9b5a962df3007009a227ef421600197531f970a5/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L117
> >>> >> >
> >>> >> >
> >>> >> > > - Is that heuristic configurable at all?
> >>> >> >
> >>> >> >
> >>> >> > I think you are using the right setting, but maybe it is possible
> >>> the
> >>> >> > strings are still exceeding the threshold (perhaps increasing it
> by
> >>> 50x
> >>> >> or
> >>> >> > more to verify)
> >>> >> >
> >>> >> >
> >>> >> > > - For high-cardinality datasets, has the idea of a
> frequency-based
> >>> >> > > dictionary encoding been explored? Say, if the data follows a
> >>> certain
> >>> >> > > statistical distribution, we can create a dictionary of the most
> >>> >> frequent
> >>> >> > > values only?
> >>> >> >
> >>> >> > I haven't seen my discussion during my time in the community but
> >>> maybe
> >>> >> it
> >>> >> > was discussed in the past.  I think the main challenge here is
> that
> >>> >> pages
> >>> >> > are either dictionary encoded or not.  I'd guess to make this
> >>> practical
> >>> >> > there would need to be a new hybrid page type, which I think it
> >>> might
> >>> >> be an
> >>> >> > interesting idea but quite a bit of work.  Additionally, one would
> >>> >> likely
> >>> >> > need heuristics for when to potentially use the new mode versus a
> >>> >> complete
> >>> >> > fallback.
> >>> >> >
> >>> >> > Cheers,
> >>> >> > Micah
> >>> >> >
> >>> >> > On Thu, Sep 14, 2023 at 12:07 PM Claire McGinty <
> >>> >> > claire.d.mcgi...@gmail.com>
> >>> >> > wrote:
> >>> >> >
> >>> >> > > Hi dev@,
> >>> >> > >
> >>> >> > > I'm running some benchmarking on Parquet read/write performance
> >>> and
> >>> >> have
> >>> >> > a
> >>> >> > > few questions about how dictionary encoding works under the
> hood.
> >>> Let
> >>> >> me
> >>> >> > > know if there's a better channel for this :)
> >>> >> > >
> >>> >> > > My test case uses parquet-avro, where I'm writing a single file
> >>> >> > containing
> >>> >> > > 5 million records. Each record has a single column, an Avro
> String
> >>> >> field
> >>> >> > > (Parquet binary field). I ran two configurations of base setup:
> >>> in the
> >>> >> > > first case, the string field has 5,000 possible unique values.
> In
> >>> the
> >>> >> > > second case, it has 50,000 unique values.
> >>> >> > >
> >>> >> > > In the first case (5k unique values), I used parquet-tools to
> >>> inspect
> >>> >> the
> >>> >> > > file metadata and found that a dictionary had been written:
> >>> >> > >
> >>> >> > > % parquet-tools meta testdata-case1.parquet
> >>> >> > > > file schema:  testdata.TestRecord
> >>> >> > > >
> >>> >> > > >
> >>> >> > >
> >>> >> >
> >>> >>
> >>>
> --------------------------------------------------------------------------------
> >>> >> > > > stringField:  REQUIRED BINARY L:STRING R:0 D:0
> >>> >> > > > row group 1:  RC:5000001 TS:18262874 OFFSET:4
> >>> >> > > >
> >>> >> > > >
> >>> >> > >
> >>> >> >
> >>> >>
> >>>
> --------------------------------------------------------------------------------
> >>> >> > > > stringField:   BINARY UNCOMPRESSED DO:4 FPO:38918
> >>> >> > SZ:8181452/8181452/1.00
> >>> >> > > > VC:5000001 ENC:BIT_PACKED,PLAIN_DICTIONARY ST:[min: 0, max:
> 999,
> >>> >> > > num_nulls:
> >>> >> > > > 0]
> >>> >> > >
> >>> >> > >
> >>> >> > > But in the second case (50k unique values), parquet-tools shows
> >>> that
> >>> >> no
> >>> >> > > dictionary gets created, and the file size is *much* bigger:
> >>> >> > >
> >>> >> > > % parquet-tools meta testdata-case2.parquet
> >>> >> > > > file schema:  testdata.TestRecord
> >>> >> > > >
> >>> >> > > >
> >>> >> > >
> >>> >> >
> >>> >>
> >>>
> --------------------------------------------------------------------------------
> >>> >> > > > stringField:  REQUIRED BINARY L:STRING R:0 D:0
> >>> >> > > > row group 1:  RC:5000001 TS:18262874 OFFSET:4
> >>> >> > > >
> >>> >> > > >
> >>> >> > >
> >>> >> >
> >>> >>
> >>>
> --------------------------------------------------------------------------------
> >>> >> > > > stringField:  BINARY UNCOMPRESSED DO:0 FPO:4
> >>> >> SZ:43896278/43896278/1.00
> >>> >> > > > VC:5000001 ENC:PLAIN,BIT_PACKED ST:[min: 0, max: 9999,
> >>> num_nulls: 0]
> >>> >> > >
> >>> >> > >
> >>> >> > > (I created a gist of my test reproduction here
> >>> >> > > <
> >>> >>
> >>> https://gist.github.com/clairemcginty/c3c0be85f51bc23db45a75e8d8a18806
> >>> >> > >.)
> >>> >> > >
> >>> >> > > Based on this, I'm guessing there's some tip-over point after
> >>> which
> >>> >> > Parquet
> >>> >> > > will give up on writing a dictionary for a given column? After
> >>> reading
> >>> >> > > the Configuration
> >>> >> > > docs
> >>> >> > > <
> >>> >> >
> >>> >>
> >>>
> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md
> >>> >> > > >,
> >>> >> > > I tried increasing the dictionary page size configuration 5x,
> >>> with the
> >>> >> > same
> >>> >> > > result (no dictionary created).
> >>> >> > >
> >>> >> > > So to summarize, my questions are:
> >>> >> > >
> >>> >> > > - What's the heuristic for Parquet dictionary writing to succeed
> >>> for a
> >>> >> > > given column?
> >>> >> > > - Is that heuristic configurable at all?
> >>> >> > > - For high-cardinality datasets, has the idea of a
> frequency-based
> >>> >> > > dictionary encoding been explored? Say, if the data follows a
> >>> certain
> >>> >> > > statistical distribution, we can create a dictionary of the most
> >>> >> frequent
> >>> >> > > values only?
> >>> >> > >
> >>> >> > > Thanks for your time!
> >>> >> > > - Claire
> >>> >> > >
> >>> >> >
> >>> >>
> >>> >
> >>>
> >>
>


-- 
Aaron Niskode-Dossett, Data Engineering -- Etsy

Reply via email to