I agree thanks for looking into this.

Overall, I think the equalizing behavior of file-level compression (ZSTD)
> makes it not worth it to add a configuration option for dictionary
> compression :)

One reason to potentially still move forward with configuration here is
that general purpose compression can be significantly slower then
dictionary compression.  If you've already chosen to compress then I agree
a new knob might not be worth it.  At least some recent research [1] point
to this being a bottleneck.

[1] https://www.cs.cit.tum.de/fileadmin/w00cfj/dis/papers/btrblocks.pdf

On Wed, Sep 27, 2023 at 9:58 PM Gang Wu <ust...@gmail.com> wrote:

> Thanks for the thorough benchmark. These findings are pretty interesting!
>
> On Thu, Sep 28, 2023 at 5:32 AM Claire McGinty <claire.d.mcgi...@gmail.com
> >
> wrote:
>
> > Hi all,
> >
> > Just to follow up, I ran some benchmarks with an added Configuration
> option
> > to set a desired "compression ratio" param, which you can see here
> > <
> >
> https://github.com/clairemcginty/parquet-benchmarks/blob/main/write_results.md#max-dictionary-compression-ratio-option
> > >,
> > on a variety of data layouts (distribution, sorting, cardinality etc). I
> > also have a table of comparisons
> > <
> >
> https://github.com/clairemcginty/parquet-benchmarks/blob/main/write_results.md#overall-comparison
> > >
> > using
> > the latest 0.14.0-SNAPSHOT as a baseline. These are my takeaways:
> >
> >    - The compression ratio param doesn't benefit data that's in sorted
> >    order (IMO, because even without the addition of this param, sorted
> > columns
> >    are more likely to produce efficient dict encodings).
> >    - On shuffled (non-sorted) data, setting the compression ratio param
> >    produces a much better *uncompressed* file result (in one case, 39MB
> to
> >    the baseline 102MB). However, after applying a file-level compression
> >    algorithm such as ZSTD, the baseline and ratio-param results turn out
> > about
> >    pretty much equal (within 5% margin). I think this makes sense, since
> in
> >    Parquet 1.0 dictionaries are not encoded, so ZSTD must be better at
> >    compressing many repeated column values across a file than it is at
> >    compressing dictionaries across all pages.
> >    - The compression ratio param works best with larger page sizes (10 mb
> >    or 50mb) with large-ish dictionary page sizes (10mb).
> >
> > Overall, I think the equalizing behavior of file-level compression (ZSTD)
> > makes it not worth it to add a configuration option for dictionary
> > compression :) Thanks for all of your input on this -- if nothing else,
> the
> > benchmarks are a really interesting look at how important data layout is
> to
> > overall file size!
> >
> > Best,
> > Claire
> >
> > On Thu, Sep 21, 2023 at 2:12 PM Claire McGinty <
> claire.d.mcgi...@gmail.com
> > >
> > wrote:
> >
> > > Hmm, like a flag to basically turn off the isCompressionSatisfying
> check
> > > per-column? That might be simplest!
> > >
> > > So to summarize, a column will not write a dictionary encoding when
> > either:
> > >
> > > (1) `parquet.enable.dictionary` is set to False
> > > (2) # of distinct values in a column chunk exceeds
> > > `DictionaryValuesWriter#MAX_DICTIONARY_VALUES` (currently set to
> > > Integer.MAX_VALUE)
> > > (3) Total encoded bytes in a dictionary exceed the value of
> > > `parquet.dictionary.page.size`
> > > (4) Desired compression ratio (as a measure of # distinct values :
> total
> > #
> > > values) is not achieved
> > >
> > > I might try out various options for making (4) configurable, starting
> > with
> > > your suggestion, and testing them out on more realistic data
> > distributions.
> > > Will try to return to this thread with my results in a few days :)
> > >
> > > Best,
> > > Claire
> > >
> > >
> > > On Thu, Sep 21, 2023 at 8:48 AM Gang Wu <ust...@gmail.com> wrote:
> > >
> > >> The current implementation only checks the first page, which is
> > >> vulnerable in many cases. I think your suggestion makes sense.
> > >> However, there is no one-fit-for-all solution. How about simply
> > >> adding a flag to enforce dictionary encoding to a specific column?
> > >>
> > >>
> > >> On Thu, Sep 21, 2023 at 1:08 AM Claire McGinty <
> > >> claire.d.mcgi...@gmail.com>
> > >> wrote:
> > >>
> > >> > I think I figured it out! The dictionaryByteSize == 0 was a red
> > >> herring; I
> > >> > was looking at an IntegerDictionaryValuesWriter for an empty column
> > >> rather
> > >> > than my high-cardinality column. Your analysis of the situation was
> > >> > right--it was just that in the first page, there weren't enough
> > distinct
> > >> > values to pass the check.
> > >> >
> > >> > I wonder if we could maybe make this value configurable per-column?
> > >> Either:
> > >> >
> > >> > - A desired ratio of distinct values / total values, on a scale of
> > 0-1.0
> > >> > - Number of pages to check for compression before falling back
> > >> >
> > >> > Let me know what you think!
> > >> >
> > >> > Best,
> > >> > Claire
> > >> >
> > >> > On Wed, Sep 20, 2023 at 9:37 AM Gang Wu <ust...@gmail.com> wrote:
> > >> >
> > >> > > I don't understand why you get encodedSize == 1,
> dictionaryByteSize
> > >> == 0
> > >> > > and rawSize == 0 in the first page check. It seems that the page
> > does
> > >> not
> > >> > > have any meaning values. Could you please check how many values
> are
> > >> > > written before the page check?
> > >> > >
> > >> > > On Thu, Sep 21, 2023 at 12:12 AM Claire McGinty <
> > >> > > claire.d.mcgi...@gmail.com>
> > >> > > wrote:
> > >> > >
> > >> > > > Hey Gang,
> > >> > > >
> > >> > > > Thanks for the followup! I see what you're saying where it's
> > >> sometimes
> > >> > > just
> > >> > > > bad luck with what ends up in the first page. The intuition
> seems
> > >> like
> > >> > a
> > >> > > > larger page size should produce a better encoding in this
> case...
> > I
> > >> > > updated
> > >> > > > my branch
> > >> > > > <
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://github.com/apache/parquet-mr/compare/master...clairemcginty:parquet-mr:dict-size-repro?expand=1
> > >> > > > >
> > >> > > > to
> > >> > > > add a test with a page size/dict page size of 10MB and am seeing
> > the
> > >> > same
> > >> > > > failure, though.
> > >> > > >
> > >> > > > Something seems kind of odd actually -- when I stepped through
> the
> > >> > test I
> > >> > > > added w/ debugger, it falls back after invoking
> > >> isCompressionSatisfying
> > >> > > > with encodedSize == 1, dictionaryByteSize == 0 and rawSize ==
> 0; 1
> > >> + 0
> > >> > <
> > >> > > 1
> > >> > > > returns true. (You can also see this in the System.out logs I
> > >> added, in
> > >> > > the
> > >> > > > branch's GHA run logs). This doesn't seem right to me -- does
> > >> > > > isCompressionSatsifying need an extra check to make sure the
> > >> > > > dictionary isn't empty?
> > >> > > >
> > >> > > > Also, thanks, Aaron! I got into this while running some
> > >> > micro-benchmarks
> > >> > > on
> > >> > > > Parquet reads when various dictionary/bloom filter/encoding
> > options
> > >> are
> > >> > > > configured. Happy to share out when I'm done.
> > >> > > >
> > >> > > > Best,
> > >> > > > Claire
> > >> > > >
> > >> > > > On Tue, Sep 19, 2023 at 9:06 PM Gang Wu <ust...@gmail.com>
> wrote:
> > >> > > >
> > >> > > > > Thanks for the investigation!
> > >> > > > >
> > >> > > > > I think the check below makes sense for a single page:
> > >> > > > >   @Override
> > >> > > > >   public boolean isCompressionSatisfying(long rawSize, long
> > >> > > encodedSize)
> > >> > > > {
> > >> > > > >     return (encodedSize + dictionaryByteSize) < rawSize;
> > >> > > > >   }
> > >> > > > >
> > >> > > > > The problem is that the fallback check is only performed on
> the
> > >> first
> > >> > > > page.
> > >> > > > > In the first page, all values in that page may be distinct, so
> > it
> > >> > will
> > >> > > > > unlikely
> > >> > > > > pass the isCompressionSatisfying check.
> > >> > > > >
> > >> > > > > Best,
> > >> > > > > Gang
> > >> > > > >
> > >> > > > >
> > >> > > > > On Wed, Sep 20, 2023 at 5:04 AM Aaron Niskode-Dossett
> > >> > > > > <aniskodedoss...@etsy.com.invalid> wrote:
> > >> > > > >
> > >> > > > > > Claire, thank you for your research and examples on this
> > topic,
> > >> > I've
> > >> > > > > > learned a lot.  My hunch is that your change would be a good
> > >> one,
> > >> > but
> > >> > > > I'm
> > >> > > > > > not an expert (and more to the point, not a committer).  I'm
> > >> > looking
> > >> > > > > > forward to learning more as this discussion continues.
> > >> > > > > >
> > >> > > > > > Thank you again, Aaron
> > >> > > > > >
> > >> > > > > > On Tue, Sep 19, 2023 at 2:48 PM Claire McGinty <
> > >> > > > > claire.d.mcgi...@gmail.com
> > >> > > > > > >
> > >> > > > > > wrote:
> > >> > > > > >
> > >> > > > > > > I created a quick branch
> > >> > > > > > > <
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://github.com/apache/parquet-mr/compare/master...clairemcginty:parquet-mr:dict-size-repro?expand=1
> > >> > > > > > > >
> > >> > > > > > > to reproduce what I'm seeing -- the test shows that an Int
> > >> column
> > >> > > > with
> > >> > > > > > > cardinality 100 successfully results in a dict encoding,
> but
> > >> an
> > >> > int
> > >> > > > > > column
> > >> > > > > > > with cardinality 10,000 falls back and doesn't create a
> dict
> > >> > > > encoding.
> > >> > > > > > This
> > >> > > > > > > seems like a low threshold given the 1MB dictionary page
> > size,
> > >> > so I
> > >> > > > > just
> > >> > > > > > > wanted to check whether this is expected or not :)
> > >> > > > > > >
> > >> > > > > > > Best,
> > >> > > > > > > Claire
> > >> > > > > > >
> > >> > > > > > > On Tue, Sep 19, 2023 at 9:35 AM Claire McGinty <
> > >> > > > > > claire.d.mcgi...@gmail.com
> > >> > > > > > > >
> > >> > > > > > > wrote:
> > >> > > > > > >
> > >> > > > > > > > Hi, just wanted to follow up on this!
> > >> > > > > > > >
> > >> > > > > > > > I ran a debugger to find out why my column wasn't ending
> > up
> > >> > with
> > >> > > a
> > >> > > > > > > > dictionary encoding and it turns out that even though
> > >> > > > > > > > DictionaryValuesWriter#shouldFallback()
> > >> > > > > > > > <
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L117
> > >> > > > > > > >
> > >> > > > > > > > always returned false (dictionaryByteSize was always
> less
> > >> than
> > >> > my
> > >> > > > > > > > configured page size),
> > >> > > > DictionaryValuesWriter#isCompressionSatisfying
> > >> > > > > > > > <
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L125
> > >> > > > > > >
> > >> > > > > > > was
> > >> > > > > > > > what was causing Parquet to switch
> > >> > > > > > > > <
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/fallback/FallbackValuesWriter.java#L75
> > >> > > > > > > >
> > >> > > > > > > > back to the fallback, non-dict writer.
> > >> > > > > > > >
> > >> > > > > > > > From what I can tell, this check compares the total byte
> > >> size
> > >> > of
> > >> > > > > > > > *every* element with the byte size of each *distinct*
> > >> element
> > >> > as
> > >> > > a
> > >> > > > > kind
> > >> > > > > > > of
> > >> > > > > > > > proxy for encoding efficiency.... however, it seems
> > strange
> > >> > that
> > >> > > > this
> > >> > > > > > > check
> > >> > > > > > > > can cause the writer to fall back even if the total
> > encoded
> > >> > dict
> > >> > > > size
> > >> > > > > > is
> > >> > > > > > > > far below the configured dictionary page size. Out of
> > >> > curiosity,
> > >> > > I
> > >> > > > > > > modified
> > >> > > > > > > > DictionaryValuesWriter#isCompressionSatisfying
> > >> > > > > > > > <
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L125
> > >> > > > > > >
> > >> > > > > > > to
> > >> > > > > > > > also check whether total byte size was less than
> > dictionary
> > >> max
> > >> > > > size
> > >> > > > > > and
> > >> > > > > > > > re-ran my Parquet write with a local snapshot, and my
> file
> > >> size
> > >> > > > > dropped
> > >> > > > > > > 50%.
> > >> > > > > > > >
> > >> > > > > > > > Best,
> > >> > > > > > > > Claire
> > >> > > > > > > >
> > >> > > > > > > > On Mon, Sep 18, 2023 at 9:16 AM Claire McGinty <
> > >> > > > > > > claire.d.mcgi...@gmail.com>
> > >> > > > > > > > wrote:
> > >> > > > > > > >
> > >> > > > > > > >> Oh, interesting! I'm setting it via the
> > >> > > > > > > >> ParquetWriter#withDictionaryPageSize method, and I do
> see
> > >> the
> > >> > > > > overall
> > >> > > > > > > file
> > >> > > > > > > >> size increasing when I bump the value. I'll look into
> it
> > a
> > >> bit
> > >> > > > more
> > >> > > > > --
> > >> > > > > > > it
> > >> > > > > > > >> would be helpful for some cases where the # unique
> values
> > >> in a
> > >> > > > > column
> > >> > > > > > is
> > >> > > > > > > >> just over the size limit.
> > >> > > > > > > >>
> > >> > > > > > > >> - Claire
> > >> > > > > > > >>
> > >> > > > > > > >> On Fri, Sep 15, 2023 at 9:54 AM Micah Kornfield <
> > >> > > > > > emkornfi...@gmail.com>
> > >> > > > > > > >> wrote:
> > >> > > > > > > >>
> > >> > > > > > > >>> I'll note there is also a check for encoding
> > effectiveness
> > >> > [1]
> > >> > > > that
> > >> > > > > > > could
> > >> > > > > > > >>> come into play but I'd guess that isn't the case here.
> > >> > > > > > > >>>
> > >> > > > > > > >>> [1]
> > >> > > > > > > >>>
> > >> > > > > > > >>>
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://github.com/apache/parquet-mr/blob/9b5a962df3007009a227ef421600197531f970a5/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L124
> > >> > > > > > > >>>
> > >> > > > > > > >>> On Fri, Sep 15, 2023 at 9:51 AM Micah Kornfield <
> > >> > > > > > emkornfi...@gmail.com
> > >> > > > > > > >
> > >> > > > > > > >>> wrote:
> > >> > > > > > > >>>
> > >> > > > > > > >>> > I'm glad I was looking at the right setting for
> > >> dictionary
> > >> > > > size.
> > >> > > > > I
> > >> > > > > > > just
> > >> > > > > > > >>> >> tried it out with 10x, 50x, and even total file
> size,
> > >> > > though,
> > >> > > > > and
> > >> > > > > > > >>> still am
> > >> > > > > > > >>> >> not seeing a dictionary get created. Is it possible
> > >> it's
> > >> > > > bounded
> > >> > > > > > by
> > >> > > > > > > >>> file
> > >> > > > > > > >>> >> page size or some other layout option that I need
> to
> > >> bump
> > >> > as
> > >> > > > > well?
> > >> > > > > > > >>> >
> > >> > > > > > > >>> >
> > >> > > > > > > >>> > Sorry I'm less familiar with parquet-mr, hopefully
> > >> someone
> > >> > > else
> > >> > > > > to
> > >> > > > > > > >>> chime
> > >> > > > > > > >>> > in.  If I had to guess, maybe somehow the config
> value
> > >> > isn't
> > >> > > > > making
> > >> > > > > > > it
> > >> > > > > > > >>> to
> > >> > > > > > > >>> > the writer (but there could also be something else
> at
> > >> > play).
> > >> > > > > > > >>> >
> > >> > > > > > > >>> > On Fri, Sep 15, 2023 at 9:33 AM Claire McGinty <
> > >> > > > > > > >>> claire.d.mcgi...@gmail.com>
> > >> > > > > > > >>> > wrote:
> > >> > > > > > > >>> >
> > >> > > > > > > >>> >> Thanks so much, Micah!
> > >> > > > > > > >>> >>
> > >> > > > > > > >>> >> I think you are using the right setting, but maybe
> it
> > >> is
> > >> > > > > possible
> > >> > > > > > > the
> > >> > > > > > > >>> >> > strings are still exceeding the threshold
> (perhaps
> > >> > > > increasing
> > >> > > > > it
> > >> > > > > > > by
> > >> > > > > > > >>> 50x
> > >> > > > > > > >>> >> or
> > >> > > > > > > >>> >> > more to verify)
> > >> > > > > > > >>> >>
> > >> > > > > > > >>> >>
> > >> > > > > > > >>> >> I'm glad I was looking at the right setting for
> > >> dictionary
> > >> > > > > size. I
> > >> > > > > > > >>> just
> > >> > > > > > > >>> >> tried it out with 10x, 50x, and even total file
> size,
> > >> > > though,
> > >> > > > > and
> > >> > > > > > > >>> still am
> > >> > > > > > > >>> >> not seeing a dictionary get created. Is it possible
> > >> it's
> > >> > > > bounded
> > >> > > > > > by
> > >> > > > > > > >>> file
> > >> > > > > > > >>> >> page size or some other layout option that I need
> to
> > >> bump
> > >> > as
> > >> > > > > well?
> > >> > > > > > > >>> >>
> > >> > > > > > > >>> >> I haven't seen my discussion during my time in the
> > >> > community
> > >> > > > but
> > >> > > > > > > >>> maybe it
> > >> > > > > > > >>> >> > was discussed in the past.  I think the main
> > >> challenge
> > >> > > here
> > >> > > > is
> > >> > > > > > > that
> > >> > > > > > > >>> >> pages
> > >> > > > > > > >>> >> > are either dictionary encoded or not.  I'd guess
> to
> > >> make
> > >> > > > this
> > >> > > > > > > >>> practical
> > >> > > > > > > >>> >> > there would need to be a new hybrid page type,
> > which
> > >> I
> > >> > > think
> > >> > > > > it
> > >> > > > > > > >>> might
> > >> > > > > > > >>> >> be an
> > >> > > > > > > >>> >> > interesting idea but quite a bit of work.
> > >> Additionally,
> > >> > > one
> > >> > > > > > would
> > >> > > > > > > >>> >> likely
> > >> > > > > > > >>> >> > need heuristics for when to potentially use the
> new
> > >> mode
> > >> > > > > versus
> > >> > > > > > a
> > >> > > > > > > >>> >> complete
> > >> > > > > > > >>> >> > fallback.
> > >> > > > > > > >>> >> >
> > >> > > > > > > >>> >>
> > >> > > > > > > >>> >> Got it, thanks for the explanation! It does seem
> > like a
> > >> > huge
> > >> > > > > > amount
> > >> > > > > > > of
> > >> > > > > > > >>> >> work
> > >> > > > > > > >>> >>
> > >> > > > > > > >>> >>
> > >> > > > > > > >>> >> Best,
> > >> > > > > > > >>> >> Claire
> > >> > > > > > > >>> >>
> > >> > > > > > > >>> >>
> > >> > > > > > > >>> >>
> > >> > > > > > > >>> >> On Thu, Sep 14, 2023 at 5:16 PM Micah Kornfield <
> > >> > > > > > > >>> emkornfi...@gmail.com>
> > >> > > > > > > >>> >> wrote:
> > >> > > > > > > >>> >>
> > >> > > > > > > >>> >> > >
> > >> > > > > > > >>> >> > > - What's the heuristic for Parquet dictionary
> > >> writing
> > >> > to
> > >> > > > > > succeed
> > >> > > > > > > >>> for a
> > >> > > > > > > >>> >> > > given column?
> > >> > > > > > > >>> >> >
> > >> > > > > > > >>> >> >
> > >> > > > > > > >>> >> >
> > >> > > > > > > >>> >> >
> > >> > > > > > > >>> >>
> > >> > > > > > > >>>
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://github.com/apache/parquet-mr/blob/9b5a962df3007009a227ef421600197531f970a5/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L117
> > >> > > > > > > >>> >> >
> > >> > > > > > > >>> >> >
> > >> > > > > > > >>> >> > > - Is that heuristic configurable at all?
> > >> > > > > > > >>> >> >
> > >> > > > > > > >>> >> >
> > >> > > > > > > >>> >> > I think you are using the right setting, but
> maybe
> > >> it is
> > >> > > > > > possible
> > >> > > > > > > >>> the
> > >> > > > > > > >>> >> > strings are still exceeding the threshold
> (perhaps
> > >> > > > increasing
> > >> > > > > it
> > >> > > > > > > by
> > >> > > > > > > >>> 50x
> > >> > > > > > > >>> >> or
> > >> > > > > > > >>> >> > more to verify)
> > >> > > > > > > >>> >> >
> > >> > > > > > > >>> >> >
> > >> > > > > > > >>> >> > > - For high-cardinality datasets, has the idea
> of
> > a
> > >> > > > > > > frequency-based
> > >> > > > > > > >>> >> > > dictionary encoding been explored? Say, if the
> > data
> > >> > > > follows
> > >> > > > > a
> > >> > > > > > > >>> certain
> > >> > > > > > > >>> >> > > statistical distribution, we can create a
> > >> dictionary
> > >> > of
> > >> > > > the
> > >> > > > > > most
> > >> > > > > > > >>> >> frequent
> > >> > > > > > > >>> >> > > values only?
> > >> > > > > > > >>> >> >
> > >> > > > > > > >>> >> > I haven't seen my discussion during my time in
> the
> > >> > > community
> > >> > > > > but
> > >> > > > > > > >>> maybe
> > >> > > > > > > >>> >> it
> > >> > > > > > > >>> >> > was discussed in the past.  I think the main
> > >> challenge
> > >> > > here
> > >> > > > is
> > >> > > > > > > that
> > >> > > > > > > >>> >> pages
> > >> > > > > > > >>> >> > are either dictionary encoded or not.  I'd guess
> to
> > >> make
> > >> > > > this
> > >> > > > > > > >>> practical
> > >> > > > > > > >>> >> > there would need to be a new hybrid page type,
> > which
> > >> I
> > >> > > think
> > >> > > > > it
> > >> > > > > > > >>> might
> > >> > > > > > > >>> >> be an
> > >> > > > > > > >>> >> > interesting idea but quite a bit of work.
> > >> Additionally,
> > >> > > one
> > >> > > > > > would
> > >> > > > > > > >>> >> likely
> > >> > > > > > > >>> >> > need heuristics for when to potentially use the
> new
> > >> mode
> > >> > > > > versus
> > >> > > > > > a
> > >> > > > > > > >>> >> complete
> > >> > > > > > > >>> >> > fallback.
> > >> > > > > > > >>> >> >
> > >> > > > > > > >>> >> > Cheers,
> > >> > > > > > > >>> >> > Micah
> > >> > > > > > > >>> >> >
> > >> > > > > > > >>> >> > On Thu, Sep 14, 2023 at 12:07 PM Claire McGinty <
> > >> > > > > > > >>> >> > claire.d.mcgi...@gmail.com>
> > >> > > > > > > >>> >> > wrote:
> > >> > > > > > > >>> >> >
> > >> > > > > > > >>> >> > > Hi dev@,
> > >> > > > > > > >>> >> > >
> > >> > > > > > > >>> >> > > I'm running some benchmarking on Parquet
> > read/write
> > >> > > > > > performance
> > >> > > > > > > >>> and
> > >> > > > > > > >>> >> have
> > >> > > > > > > >>> >> > a
> > >> > > > > > > >>> >> > > few questions about how dictionary encoding
> works
> > >> > under
> > >> > > > the
> > >> > > > > > > hood.
> > >> > > > > > > >>> Let
> > >> > > > > > > >>> >> me
> > >> > > > > > > >>> >> > > know if there's a better channel for this :)
> > >> > > > > > > >>> >> > >
> > >> > > > > > > >>> >> > > My test case uses parquet-avro, where I'm
> > writing a
> > >> > > single
> > >> > > > > > file
> > >> > > > > > > >>> >> > containing
> > >> > > > > > > >>> >> > > 5 million records. Each record has a single
> > >> column, an
> > >> > > > Avro
> > >> > > > > > > String
> > >> > > > > > > >>> >> field
> > >> > > > > > > >>> >> > > (Parquet binary field). I ran two
> configurations
> > of
> > >> > base
> > >> > > > > > setup:
> > >> > > > > > > >>> in the
> > >> > > > > > > >>> >> > > first case, the string field has 5,000 possible
> > >> unique
> > >> > > > > values.
> > >> > > > > > > In
> > >> > > > > > > >>> the
> > >> > > > > > > >>> >> > > second case, it has 50,000 unique values.
> > >> > > > > > > >>> >> > >
> > >> > > > > > > >>> >> > > In the first case (5k unique values), I used
> > >> > > parquet-tools
> > >> > > > > to
> > >> > > > > > > >>> inspect
> > >> > > > > > > >>> >> the
> > >> > > > > > > >>> >> > > file metadata and found that a dictionary had
> > been
> > >> > > > written:
> > >> > > > > > > >>> >> > >
> > >> > > > > > > >>> >> > > % parquet-tools meta testdata-case1.parquet
> > >> > > > > > > >>> >> > > > file schema:  testdata.TestRecord
> > >> > > > > > > >>> >> > > >
> > >> > > > > > > >>> >> > > >
> > >> > > > > > > >>> >> > >
> > >> > > > > > > >>> >> >
> > >> > > > > > > >>> >>
> > >> > > > > > > >>>
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> --------------------------------------------------------------------------------
> > >> > > > > > > >>> >> > > > stringField:  REQUIRED BINARY L:STRING R:0
> D:0
> > >> > > > > > > >>> >> > > > row group 1:  RC:5000001 TS:18262874 OFFSET:4
> > >> > > > > > > >>> >> > > >
> > >> > > > > > > >>> >> > > >
> > >> > > > > > > >>> >> > >
> > >> > > > > > > >>> >> >
> > >> > > > > > > >>> >>
> > >> > > > > > > >>>
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> --------------------------------------------------------------------------------
> > >> > > > > > > >>> >> > > > stringField:   BINARY UNCOMPRESSED DO:4
> > FPO:38918
> > >> > > > > > > >>> >> > SZ:8181452/8181452/1.00
> > >> > > > > > > >>> >> > > > VC:5000001 ENC:BIT_PACKED,PLAIN_DICTIONARY
> > >> ST:[min:
> > >> > 0,
> > >> > > > > max:
> > >> > > > > > > 999,
> > >> > > > > > > >>> >> > > num_nulls:
> > >> > > > > > > >>> >> > > > 0]
> > >> > > > > > > >>> >> > >
> > >> > > > > > > >>> >> > >
> > >> > > > > > > >>> >> > > But in the second case (50k unique values),
> > >> > > parquet-tools
> > >> > > > > > shows
> > >> > > > > > > >>> that
> > >> > > > > > > >>> >> no
> > >> > > > > > > >>> >> > > dictionary gets created, and the file size is
> > >> *much*
> > >> > > > bigger:
> > >> > > > > > > >>> >> > >
> > >> > > > > > > >>> >> > > % parquet-tools meta testdata-case2.parquet
> > >> > > > > > > >>> >> > > > file schema:  testdata.TestRecord
> > >> > > > > > > >>> >> > > >
> > >> > > > > > > >>> >> > > >
> > >> > > > > > > >>> >> > >
> > >> > > > > > > >>> >> >
> > >> > > > > > > >>> >>
> > >> > > > > > > >>>
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> --------------------------------------------------------------------------------
> > >> > > > > > > >>> >> > > > stringField:  REQUIRED BINARY L:STRING R:0
> D:0
> > >> > > > > > > >>> >> > > > row group 1:  RC:5000001 TS:18262874 OFFSET:4
> > >> > > > > > > >>> >> > > >
> > >> > > > > > > >>> >> > > >
> > >> > > > > > > >>> >> > >
> > >> > > > > > > >>> >> >
> > >> > > > > > > >>> >>
> > >> > > > > > > >>>
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> --------------------------------------------------------------------------------
> > >> > > > > > > >>> >> > > > stringField:  BINARY UNCOMPRESSED DO:0 FPO:4
> > >> > > > > > > >>> >> SZ:43896278/43896278/1.00
> > >> > > > > > > >>> >> > > > VC:5000001 ENC:PLAIN,BIT_PACKED ST:[min: 0,
> > max:
> > >> > 9999,
> > >> > > > > > > >>> num_nulls: 0]
> > >> > > > > > > >>> >> > >
> > >> > > > > > > >>> >> > >
> > >> > > > > > > >>> >> > > (I created a gist of my test reproduction here
> > >> > > > > > > >>> >> > > <
> > >> > > > > > > >>> >>
> > >> > > > > > > >>>
> > >> > > > > >
> > >> > >
> > >>
> https://gist.github.com/clairemcginty/c3c0be85f51bc23db45a75e8d8a18806
> > >> > > > > > > >>> >> > >.)
> > >> > > > > > > >>> >> > >
> > >> > > > > > > >>> >> > > Based on this, I'm guessing there's some
> tip-over
> > >> > point
> > >> > > > > after
> > >> > > > > > > >>> which
> > >> > > > > > > >>> >> > Parquet
> > >> > > > > > > >>> >> > > will give up on writing a dictionary for a
> given
> > >> > column?
> > >> > > > > After
> > >> > > > > > > >>> reading
> > >> > > > > > > >>> >> > > the Configuration
> > >> > > > > > > >>> >> > > docs
> > >> > > > > > > >>> >> > > <
> > >> > > > > > > >>> >> >
> > >> > > > > > > >>> >>
> > >> > > > > > > >>>
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md
> > >> > > > > > > >>> >> > > >,
> > >> > > > > > > >>> >> > > I tried increasing the dictionary page size
> > >> > > configuration
> > >> > > > > 5x,
> > >> > > > > > > >>> with the
> > >> > > > > > > >>> >> > same
> > >> > > > > > > >>> >> > > result (no dictionary created).
> > >> > > > > > > >>> >> > >
> > >> > > > > > > >>> >> > > So to summarize, my questions are:
> > >> > > > > > > >>> >> > >
> > >> > > > > > > >>> >> > > - What's the heuristic for Parquet dictionary
> > >> writing
> > >> > to
> > >> > > > > > succeed
> > >> > > > > > > >>> for a
> > >> > > > > > > >>> >> > > given column?
> > >> > > > > > > >>> >> > > - Is that heuristic configurable at all?
> > >> > > > > > > >>> >> > > - For high-cardinality datasets, has the idea
> of
> > a
> > >> > > > > > > frequency-based
> > >> > > > > > > >>> >> > > dictionary encoding been explored? Say, if the
> > data
> > >> > > > follows
> > >> > > > > a
> > >> > > > > > > >>> certain
> > >> > > > > > > >>> >> > > statistical distribution, we can create a
> > >> dictionary
> > >> > of
> > >> > > > the
> > >> > > > > > most
> > >> > > > > > > >>> >> frequent
> > >> > > > > > > >>> >> > > values only?
> > >> > > > > > > >>> >> > >
> > >> > > > > > > >>> >> > > Thanks for your time!
> > >> > > > > > > >>> >> > > - Claire
> > >> > > > > > > >>> >> > >
> > >> > > > > > > >>> >> >
> > >> > > > > > > >>> >>
> > >> > > > > > > >>> >
> > >> > > > > > > >>>
> > >> > > > > > > >>
> > >> > > > > > >
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > --
> > >> > > > > > Aaron Niskode-Dossett, Data Engineering -- Etsy
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> >
>

Reply via email to