Hey Gang,

Thanks for the followup! I see what you're saying where it's sometimes just
bad luck with what ends up in the first page. The intuition seems like a
larger page size should produce a better encoding in this case... I updated
my branch
<https://github.com/apache/parquet-mr/compare/master...clairemcginty:parquet-mr:dict-size-repro?expand=1>
to
add a test with a page size/dict page size of 10MB and am seeing the same
failure, though.

Something seems kind of odd actually -- when I stepped through the test I
added w/ debugger, it falls back after invoking isCompressionSatisfying
with encodedSize == 1, dictionaryByteSize == 0 and rawSize == 0; 1 + 0 < 1
returns true. (You can also see this in the System.out logs I added, in the
branch's GHA run logs). This doesn't seem right to me -- does
isCompressionSatsifying need an extra check to make sure the
dictionary isn't empty?

Also, thanks, Aaron! I got into this while running some micro-benchmarks on
Parquet reads when various dictionary/bloom filter/encoding options are
configured. Happy to share out when I'm done.

Best,
Claire

On Tue, Sep 19, 2023 at 9:06 PM Gang Wu <ust...@gmail.com> wrote:

> Thanks for the investigation!
>
> I think the check below makes sense for a single page:
>   @Override
>   public boolean isCompressionSatisfying(long rawSize, long encodedSize) {
>     return (encodedSize + dictionaryByteSize) < rawSize;
>   }
>
> The problem is that the fallback check is only performed on the first page.
> In the first page, all values in that page may be distinct, so it will
> unlikely
> pass the isCompressionSatisfying check.
>
> Best,
> Gang
>
>
> On Wed, Sep 20, 2023 at 5:04 AM Aaron Niskode-Dossett
> <aniskodedoss...@etsy.com.invalid> wrote:
>
> > Claire, thank you for your research and examples on this topic, I've
> > learned a lot.  My hunch is that your change would be a good one, but I'm
> > not an expert (and more to the point, not a committer).  I'm looking
> > forward to learning more as this discussion continues.
> >
> > Thank you again, Aaron
> >
> > On Tue, Sep 19, 2023 at 2:48 PM Claire McGinty <
> claire.d.mcgi...@gmail.com
> > >
> > wrote:
> >
> > > I created a quick branch
> > > <
> > >
> >
> https://github.com/apache/parquet-mr/compare/master...clairemcginty:parquet-mr:dict-size-repro?expand=1
> > > >
> > > to reproduce what I'm seeing -- the test shows that an Int column with
> > > cardinality 100 successfully results in a dict encoding, but an int
> > column
> > > with cardinality 10,000 falls back and doesn't create a dict encoding.
> > This
> > > seems like a low threshold given the 1MB dictionary page size, so I
> just
> > > wanted to check whether this is expected or not :)
> > >
> > > Best,
> > > Claire
> > >
> > > On Tue, Sep 19, 2023 at 9:35 AM Claire McGinty <
> > claire.d.mcgi...@gmail.com
> > > >
> > > wrote:
> > >
> > > > Hi, just wanted to follow up on this!
> > > >
> > > > I ran a debugger to find out why my column wasn't ending up with a
> > > > dictionary encoding and it turns out that even though
> > > > DictionaryValuesWriter#shouldFallback()
> > > > <
> > >
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L117
> > > >
> > > > always returned false (dictionaryByteSize was always less than my
> > > > configured page size), DictionaryValuesWriter#isCompressionSatisfying
> > > > <
> > >
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L125
> > >
> > > was
> > > > what was causing Parquet to switch
> > > > <
> > >
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/fallback/FallbackValuesWriter.java#L75
> > > >
> > > > back to the fallback, non-dict writer.
> > > >
> > > > From what I can tell, this check compares the total byte size of
> > > > *every* element with the byte size of each *distinct* element as a
> kind
> > > of
> > > > proxy for encoding efficiency.... however, it seems strange that this
> > > check
> > > > can cause the writer to fall back even if the total encoded dict size
> > is
> > > > far below the configured dictionary page size. Out of curiosity, I
> > > modified
> > > > DictionaryValuesWriter#isCompressionSatisfying
> > > > <
> > >
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L125
> > >
> > > to
> > > > also check whether total byte size was less than dictionary max size
> > and
> > > > re-ran my Parquet write with a local snapshot, and my file size
> dropped
> > > 50%.
> > > >
> > > > Best,
> > > > Claire
> > > >
> > > > On Mon, Sep 18, 2023 at 9:16 AM Claire McGinty <
> > > claire.d.mcgi...@gmail.com>
> > > > wrote:
> > > >
> > > >> Oh, interesting! I'm setting it via the
> > > >> ParquetWriter#withDictionaryPageSize method, and I do see the
> overall
> > > file
> > > >> size increasing when I bump the value. I'll look into it a bit more
> --
> > > it
> > > >> would be helpful for some cases where the # unique values in a
> column
> > is
> > > >> just over the size limit.
> > > >>
> > > >> - Claire
> > > >>
> > > >> On Fri, Sep 15, 2023 at 9:54 AM Micah Kornfield <
> > emkornfi...@gmail.com>
> > > >> wrote:
> > > >>
> > > >>> I'll note there is also a check for encoding effectiveness [1] that
> > > could
> > > >>> come into play but I'd guess that isn't the case here.
> > > >>>
> > > >>> [1]
> > > >>>
> > > >>>
> > >
> >
> https://github.com/apache/parquet-mr/blob/9b5a962df3007009a227ef421600197531f970a5/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L124
> > > >>>
> > > >>> On Fri, Sep 15, 2023 at 9:51 AM Micah Kornfield <
> > emkornfi...@gmail.com
> > > >
> > > >>> wrote:
> > > >>>
> > > >>> > I'm glad I was looking at the right setting for dictionary size.
> I
> > > just
> > > >>> >> tried it out with 10x, 50x, and even total file size, though,
> and
> > > >>> still am
> > > >>> >> not seeing a dictionary get created. Is it possible it's bounded
> > by
> > > >>> file
> > > >>> >> page size or some other layout option that I need to bump as
> well?
> > > >>> >
> > > >>> >
> > > >>> > Sorry I'm less familiar with parquet-mr, hopefully someone else
> to
> > > >>> chime
> > > >>> > in.  If I had to guess, maybe somehow the config value isn't
> making
> > > it
> > > >>> to
> > > >>> > the writer (but there could also be something else at play).
> > > >>> >
> > > >>> > On Fri, Sep 15, 2023 at 9:33 AM Claire McGinty <
> > > >>> claire.d.mcgi...@gmail.com>
> > > >>> > wrote:
> > > >>> >
> > > >>> >> Thanks so much, Micah!
> > > >>> >>
> > > >>> >> I think you are using the right setting, but maybe it is
> possible
> > > the
> > > >>> >> > strings are still exceeding the threshold (perhaps increasing
> it
> > > by
> > > >>> 50x
> > > >>> >> or
> > > >>> >> > more to verify)
> > > >>> >>
> > > >>> >>
> > > >>> >> I'm glad I was looking at the right setting for dictionary
> size. I
> > > >>> just
> > > >>> >> tried it out with 10x, 50x, and even total file size, though,
> and
> > > >>> still am
> > > >>> >> not seeing a dictionary get created. Is it possible it's bounded
> > by
> > > >>> file
> > > >>> >> page size or some other layout option that I need to bump as
> well?
> > > >>> >>
> > > >>> >> I haven't seen my discussion during my time in the community but
> > > >>> maybe it
> > > >>> >> > was discussed in the past.  I think the main challenge here is
> > > that
> > > >>> >> pages
> > > >>> >> > are either dictionary encoded or not.  I'd guess to make this
> > > >>> practical
> > > >>> >> > there would need to be a new hybrid page type, which I think
> it
> > > >>> might
> > > >>> >> be an
> > > >>> >> > interesting idea but quite a bit of work.  Additionally, one
> > would
> > > >>> >> likely
> > > >>> >> > need heuristics for when to potentially use the new mode
> versus
> > a
> > > >>> >> complete
> > > >>> >> > fallback.
> > > >>> >> >
> > > >>> >>
> > > >>> >> Got it, thanks for the explanation! It does seem like a huge
> > amount
> > > of
> > > >>> >> work
> > > >>> >>
> > > >>> >>
> > > >>> >> Best,
> > > >>> >> Claire
> > > >>> >>
> > > >>> >>
> > > >>> >>
> > > >>> >> On Thu, Sep 14, 2023 at 5:16 PM Micah Kornfield <
> > > >>> emkornfi...@gmail.com>
> > > >>> >> wrote:
> > > >>> >>
> > > >>> >> > >
> > > >>> >> > > - What's the heuristic for Parquet dictionary writing to
> > succeed
> > > >>> for a
> > > >>> >> > > given column?
> > > >>> >> >
> > > >>> >> >
> > > >>> >> >
> > > >>> >> >
> > > >>> >>
> > > >>>
> > >
> >
> https://github.com/apache/parquet-mr/blob/9b5a962df3007009a227ef421600197531f970a5/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L117
> > > >>> >> >
> > > >>> >> >
> > > >>> >> > > - Is that heuristic configurable at all?
> > > >>> >> >
> > > >>> >> >
> > > >>> >> > I think you are using the right setting, but maybe it is
> > possible
> > > >>> the
> > > >>> >> > strings are still exceeding the threshold (perhaps increasing
> it
> > > by
> > > >>> 50x
> > > >>> >> or
> > > >>> >> > more to verify)
> > > >>> >> >
> > > >>> >> >
> > > >>> >> > > - For high-cardinality datasets, has the idea of a
> > > frequency-based
> > > >>> >> > > dictionary encoding been explored? Say, if the data follows
> a
> > > >>> certain
> > > >>> >> > > statistical distribution, we can create a dictionary of the
> > most
> > > >>> >> frequent
> > > >>> >> > > values only?
> > > >>> >> >
> > > >>> >> > I haven't seen my discussion during my time in the community
> but
> > > >>> maybe
> > > >>> >> it
> > > >>> >> > was discussed in the past.  I think the main challenge here is
> > > that
> > > >>> >> pages
> > > >>> >> > are either dictionary encoded or not.  I'd guess to make this
> > > >>> practical
> > > >>> >> > there would need to be a new hybrid page type, which I think
> it
> > > >>> might
> > > >>> >> be an
> > > >>> >> > interesting idea but quite a bit of work.  Additionally, one
> > would
> > > >>> >> likely
> > > >>> >> > need heuristics for when to potentially use the new mode
> versus
> > a
> > > >>> >> complete
> > > >>> >> > fallback.
> > > >>> >> >
> > > >>> >> > Cheers,
> > > >>> >> > Micah
> > > >>> >> >
> > > >>> >> > On Thu, Sep 14, 2023 at 12:07 PM Claire McGinty <
> > > >>> >> > claire.d.mcgi...@gmail.com>
> > > >>> >> > wrote:
> > > >>> >> >
> > > >>> >> > > Hi dev@,
> > > >>> >> > >
> > > >>> >> > > I'm running some benchmarking on Parquet read/write
> > performance
> > > >>> and
> > > >>> >> have
> > > >>> >> > a
> > > >>> >> > > few questions about how dictionary encoding works under the
> > > hood.
> > > >>> Let
> > > >>> >> me
> > > >>> >> > > know if there's a better channel for this :)
> > > >>> >> > >
> > > >>> >> > > My test case uses parquet-avro, where I'm writing a single
> > file
> > > >>> >> > containing
> > > >>> >> > > 5 million records. Each record has a single column, an Avro
> > > String
> > > >>> >> field
> > > >>> >> > > (Parquet binary field). I ran two configurations of base
> > setup:
> > > >>> in the
> > > >>> >> > > first case, the string field has 5,000 possible unique
> values.
> > > In
> > > >>> the
> > > >>> >> > > second case, it has 50,000 unique values.
> > > >>> >> > >
> > > >>> >> > > In the first case (5k unique values), I used parquet-tools
> to
> > > >>> inspect
> > > >>> >> the
> > > >>> >> > > file metadata and found that a dictionary had been written:
> > > >>> >> > >
> > > >>> >> > > % parquet-tools meta testdata-case1.parquet
> > > >>> >> > > > file schema:  testdata.TestRecord
> > > >>> >> > > >
> > > >>> >> > > >
> > > >>> >> > >
> > > >>> >> >
> > > >>> >>
> > > >>>
> > >
> >
> --------------------------------------------------------------------------------
> > > >>> >> > > > stringField:  REQUIRED BINARY L:STRING R:0 D:0
> > > >>> >> > > > row group 1:  RC:5000001 TS:18262874 OFFSET:4
> > > >>> >> > > >
> > > >>> >> > > >
> > > >>> >> > >
> > > >>> >> >
> > > >>> >>
> > > >>>
> > >
> >
> --------------------------------------------------------------------------------
> > > >>> >> > > > stringField:   BINARY UNCOMPRESSED DO:4 FPO:38918
> > > >>> >> > SZ:8181452/8181452/1.00
> > > >>> >> > > > VC:5000001 ENC:BIT_PACKED,PLAIN_DICTIONARY ST:[min: 0,
> max:
> > > 999,
> > > >>> >> > > num_nulls:
> > > >>> >> > > > 0]
> > > >>> >> > >
> > > >>> >> > >
> > > >>> >> > > But in the second case (50k unique values), parquet-tools
> > shows
> > > >>> that
> > > >>> >> no
> > > >>> >> > > dictionary gets created, and the file size is *much* bigger:
> > > >>> >> > >
> > > >>> >> > > % parquet-tools meta testdata-case2.parquet
> > > >>> >> > > > file schema:  testdata.TestRecord
> > > >>> >> > > >
> > > >>> >> > > >
> > > >>> >> > >
> > > >>> >> >
> > > >>> >>
> > > >>>
> > >
> >
> --------------------------------------------------------------------------------
> > > >>> >> > > > stringField:  REQUIRED BINARY L:STRING R:0 D:0
> > > >>> >> > > > row group 1:  RC:5000001 TS:18262874 OFFSET:4
> > > >>> >> > > >
> > > >>> >> > > >
> > > >>> >> > >
> > > >>> >> >
> > > >>> >>
> > > >>>
> > >
> >
> --------------------------------------------------------------------------------
> > > >>> >> > > > stringField:  BINARY UNCOMPRESSED DO:0 FPO:4
> > > >>> >> SZ:43896278/43896278/1.00
> > > >>> >> > > > VC:5000001 ENC:PLAIN,BIT_PACKED ST:[min: 0, max: 9999,
> > > >>> num_nulls: 0]
> > > >>> >> > >
> > > >>> >> > >
> > > >>> >> > > (I created a gist of my test reproduction here
> > > >>> >> > > <
> > > >>> >>
> > > >>>
> > https://gist.github.com/clairemcginty/c3c0be85f51bc23db45a75e8d8a18806
> > > >>> >> > >.)
> > > >>> >> > >
> > > >>> >> > > Based on this, I'm guessing there's some tip-over point
> after
> > > >>> which
> > > >>> >> > Parquet
> > > >>> >> > > will give up on writing a dictionary for a given column?
> After
> > > >>> reading
> > > >>> >> > > the Configuration
> > > >>> >> > > docs
> > > >>> >> > > <
> > > >>> >> >
> > > >>> >>
> > > >>>
> > >
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md
> > > >>> >> > > >,
> > > >>> >> > > I tried increasing the dictionary page size configuration
> 5x,
> > > >>> with the
> > > >>> >> > same
> > > >>> >> > > result (no dictionary created).
> > > >>> >> > >
> > > >>> >> > > So to summarize, my questions are:
> > > >>> >> > >
> > > >>> >> > > - What's the heuristic for Parquet dictionary writing to
> > succeed
> > > >>> for a
> > > >>> >> > > given column?
> > > >>> >> > > - Is that heuristic configurable at all?
> > > >>> >> > > - For high-cardinality datasets, has the idea of a
> > > frequency-based
> > > >>> >> > > dictionary encoding been explored? Say, if the data follows
> a
> > > >>> certain
> > > >>> >> > > statistical distribution, we can create a dictionary of the
> > most
> > > >>> >> frequent
> > > >>> >> > > values only?
> > > >>> >> > >
> > > >>> >> > > Thanks for your time!
> > > >>> >> > > - Claire
> > > >>> >> > >
> > > >>> >> >
> > > >>> >>
> > > >>> >
> > > >>>
> > > >>
> > >
> >
> >
> > --
> > Aaron Niskode-Dossett, Data Engineering -- Etsy
> >
>

Reply via email to