Re: Parquet dictionary size limits?

Claire McGinty Thu, 21 Sep 2023 11:13:05 -0700

Hmm, like a flag to basically turn off the isCompressionSatisfying check
per-column? That might be simplest!


So to summarize, a column will not write a dictionary encoding when either:

(1) `parquet.enable.dictionary` is set to False
(2) # of distinct values in a column chunk exceeds
`DictionaryValuesWriter#MAX_DICTIONARY_VALUES` (currently set to
Integer.MAX_VALUE)
(3) Total encoded bytes in a dictionary exceed the value of
`parquet.dictionary.page.size`
(4) Desired compression ratio (as a measure of # distinct values : total #
values) is not achieved

I might try out various options for making (4) configurable, starting with
your suggestion, and testing them out on more realistic data distributions.
Will try to return to this thread with my results in a few days :)

Best,
Claire


On Thu, Sep 21, 2023 at 8:48 AM Gang Wu <ust...@gmail.com> wrote:

> The current implementation only checks the first page, which is
> vulnerable in many cases. I think your suggestion makes sense.
> However, there is no one-fit-for-all solution. How about simply
> adding a flag to enforce dictionary encoding to a specific column?
>
>
> On Thu, Sep 21, 2023 at 1:08 AM Claire McGinty <claire.d.mcgi...@gmail.com
> >
> wrote:
>
> > I think I figured it out! The dictionaryByteSize == 0 was a red herring;
> I
> > was looking at an IntegerDictionaryValuesWriter for an empty column
> rather
> > than my high-cardinality column. Your analysis of the situation was
> > right--it was just that in the first page, there weren't enough distinct
> > values to pass the check.
> >
> > I wonder if we could maybe make this value configurable per-column?
> Either:
> >
> > - A desired ratio of distinct values / total values, on a scale of 0-1.0
> > - Number of pages to check for compression before falling back
> >
> > Let me know what you think!
> >
> > Best,
> > Claire
> >
> > On Wed, Sep 20, 2023 at 9:37 AM Gang Wu <ust...@gmail.com> wrote:
> >
> > > I don't understand why you get encodedSize == 1, dictionaryByteSize ==
> 0
> > > and rawSize == 0 in the first page check. It seems that the page does
> not
> > > have any meaning values. Could you please check how many values are
> > > written before the page check?
> > >
> > > On Thu, Sep 21, 2023 at 12:12 AM Claire McGinty <
> > > claire.d.mcgi...@gmail.com>
> > > wrote:
> > >
> > > > Hey Gang,
> > > >
> > > > Thanks for the followup! I see what you're saying where it's
> sometimes
> > > just
> > > > bad luck with what ends up in the first page. The intuition seems
> like
> > a
> > > > larger page size should produce a better encoding in this case... I
> > > updated
> > > > my branch
> > > > <
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/compare/master...clairemcginty:parquet-mr:dict-size-repro?expand=1
> > > > >
> > > > to
> > > > add a test with a page size/dict page size of 10MB and am seeing the
> > same
> > > > failure, though.
> > > >
> > > > Something seems kind of odd actually -- when I stepped through the
> > test I
> > > > added w/ debugger, it falls back after invoking
> isCompressionSatisfying
> > > > with encodedSize == 1, dictionaryByteSize == 0 and rawSize == 0; 1 +
> 0
> > <
> > > 1
> > > > returns true. (You can also see this in the System.out logs I added,
> in
> > > the
> > > > branch's GHA run logs). This doesn't seem right to me -- does
> > > > isCompressionSatsifying need an extra check to make sure the
> > > > dictionary isn't empty?
> > > >
> > > > Also, thanks, Aaron! I got into this while running some
> > micro-benchmarks
> > > on
> > > > Parquet reads when various dictionary/bloom filter/encoding options
> are
> > > > configured. Happy to share out when I'm done.
> > > >
> > > > Best,
> > > > Claire
> > > >
> > > > On Tue, Sep 19, 2023 at 9:06 PM Gang Wu <ust...@gmail.com> wrote:
> > > >
> > > > > Thanks for the investigation!
> > > > >
> > > > > I think the check below makes sense for a single page:
> > > > >   @Override
> > > > >   public boolean isCompressionSatisfying(long rawSize, long
> > > encodedSize)
> > > > {
> > > > >     return (encodedSize + dictionaryByteSize) < rawSize;
> > > > >   }
> > > > >
> > > > > The problem is that the fallback check is only performed on the
> first
> > > > page.
> > > > > In the first page, all values in that page may be distinct, so it
> > will
> > > > > unlikely
> > > > > pass the isCompressionSatisfying check.
> > > > >
> > > > > Best,
> > > > > Gang
> > > > >
> > > > >
> > > > > On Wed, Sep 20, 2023 at 5:04 AM Aaron Niskode-Dossett
> > > > > <aniskodedoss...@etsy.com.invalid> wrote:
> > > > >
> > > > > > Claire, thank you for your research and examples on this topic,
> > I've
> > > > > > learned a lot.  My hunch is that your change would be a good one,
> > but
> > > > I'm
> > > > > > not an expert (and more to the point, not a committer).  I'm
> > looking
> > > > > > forward to learning more as this discussion continues.
> > > > > >
> > > > > > Thank you again, Aaron
> > > > > >
> > > > > > On Tue, Sep 19, 2023 at 2:48 PM Claire McGinty <
> > > > > claire.d.mcgi...@gmail.com
> > > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > I created a quick branch
> > > > > > > <
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/compare/master...clairemcginty:parquet-mr:dict-size-repro?expand=1
> > > > > > > >
> > > > > > > to reproduce what I'm seeing -- the test shows that an Int
> column
> > > > with
> > > > > > > cardinality 100 successfully results in a dict encoding, but an
> > int
> > > > > > column
> > > > > > > with cardinality 10,000 falls back and doesn't create a dict
> > > > encoding.
> > > > > > This
> > > > > > > seems like a low threshold given the 1MB dictionary page size,
> > so I
> > > > > just
> > > > > > > wanted to check whether this is expected or not :)
> > > > > > >
> > > > > > > Best,
> > > > > > > Claire
> > > > > > >
> > > > > > > On Tue, Sep 19, 2023 at 9:35 AM Claire McGinty <
> > > > > > claire.d.mcgi...@gmail.com
> > > > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi, just wanted to follow up on this!
> > > > > > > >
> > > > > > > > I ran a debugger to find out why my column wasn't ending up
> > with
> > > a
> > > > > > > > dictionary encoding and it turns out that even though
> > > > > > > > DictionaryValuesWriter#shouldFallback()
> > > > > > > > <
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L117
> > > > > > > >
> > > > > > > > always returned false (dictionaryByteSize was always less
> than
> > my
> > > > > > > > configured page size),
> > > > DictionaryValuesWriter#isCompressionSatisfying
> > > > > > > > <
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L125
> > > > > > >
> > > > > > > was
> > > > > > > > what was causing Parquet to switch
> > > > > > > > <
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/fallback/FallbackValuesWriter.java#L75
> > > > > > > >
> > > > > > > > back to the fallback, non-dict writer.
> > > > > > > >
> > > > > > > > From what I can tell, this check compares the total byte size
> > of
> > > > > > > > *every* element with the byte size of each *distinct* element
> > as
> > > a
> > > > > kind
> > > > > > > of
> > > > > > > > proxy for encoding efficiency.... however, it seems strange
> > that
> > > > this
> > > > > > > check
> > > > > > > > can cause the writer to fall back even if the total encoded
> > dict
> > > > size
> > > > > > is
> > > > > > > > far below the configured dictionary page size. Out of
> > curiosity,
> > > I
> > > > > > > modified
> > > > > > > > DictionaryValuesWriter#isCompressionSatisfying
> > > > > > > > <
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L125
> > > > > > >
> > > > > > > to
> > > > > > > > also check whether total byte size was less than dictionary
> max
> > > > size
> > > > > > and
> > > > > > > > re-ran my Parquet write with a local snapshot, and my file
> size
> > > > > dropped
> > > > > > > 50%.
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Claire
> > > > > > > >
> > > > > > > > On Mon, Sep 18, 2023 at 9:16 AM Claire McGinty <
> > > > > > > claire.d.mcgi...@gmail.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > >> Oh, interesting! I'm setting it via the
> > > > > > > >> ParquetWriter#withDictionaryPageSize method, and I do see
> the
> > > > > overall
> > > > > > > file
> > > > > > > >> size increasing when I bump the value. I'll look into it a
> bit
> > > > more
> > > > > --
> > > > > > > it
> > > > > > > >> would be helpful for some cases where the # unique values
> in a
> > > > > column
> > > > > > is
> > > > > > > >> just over the size limit.
> > > > > > > >>
> > > > > > > >> - Claire
> > > > > > > >>
> > > > > > > >> On Fri, Sep 15, 2023 at 9:54 AM Micah Kornfield <
> > > > > > emkornfi...@gmail.com>
> > > > > > > >> wrote:
> > > > > > > >>
> > > > > > > >>> I'll note there is also a check for encoding effectiveness
> > [1]
> > > > that
> > > > > > > could
> > > > > > > >>> come into play but I'd guess that isn't the case here.
> > > > > > > >>>
> > > > > > > >>> [1]
> > > > > > > >>>
> > > > > > > >>>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/blob/9b5a962df3007009a227ef421600197531f970a5/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L124
> > > > > > > >>>
> > > > > > > >>> On Fri, Sep 15, 2023 at 9:51 AM Micah Kornfield <
> > > > > > emkornfi...@gmail.com
> > > > > > > >
> > > > > > > >>> wrote:
> > > > > > > >>>
> > > > > > > >>> > I'm glad I was looking at the right setting for
> dictionary
> > > > size.
> > > > > I
> > > > > > > just
> > > > > > > >>> >> tried it out with 10x, 50x, and even total file size,
> > > though,
> > > > > and
> > > > > > > >>> still am
> > > > > > > >>> >> not seeing a dictionary get created. Is it possible it's
> > > > bounded
> > > > > > by
> > > > > > > >>> file
> > > > > > > >>> >> page size or some other layout option that I need to
> bump
> > as
> > > > > well?
> > > > > > > >>> >
> > > > > > > >>> >
> > > > > > > >>> > Sorry I'm less familiar with parquet-mr, hopefully
> someone
> > > else
> > > > > to
> > > > > > > >>> chime
> > > > > > > >>> > in.  If I had to guess, maybe somehow the config value
> > isn't
> > > > > making
> > > > > > > it
> > > > > > > >>> to
> > > > > > > >>> > the writer (but there could also be something else at
> > play).
> > > > > > > >>> >
> > > > > > > >>> > On Fri, Sep 15, 2023 at 9:33 AM Claire McGinty <
> > > > > > > >>> claire.d.mcgi...@gmail.com>
> > > > > > > >>> > wrote:
> > > > > > > >>> >
> > > > > > > >>> >> Thanks so much, Micah!
> > > > > > > >>> >>
> > > > > > > >>> >> I think you are using the right setting, but maybe it is
> > > > > possible
> > > > > > > the
> > > > > > > >>> >> > strings are still exceeding the threshold (perhaps
> > > > increasing
> > > > > it
> > > > > > > by
> > > > > > > >>> 50x
> > > > > > > >>> >> or
> > > > > > > >>> >> > more to verify)
> > > > > > > >>> >>
> > > > > > > >>> >>
> > > > > > > >>> >> I'm glad I was looking at the right setting for
> dictionary
> > > > > size. I
> > > > > > > >>> just
> > > > > > > >>> >> tried it out with 10x, 50x, and even total file size,
> > > though,
> > > > > and
> > > > > > > >>> still am
> > > > > > > >>> >> not seeing a dictionary get created. Is it possible it's
> > > > bounded
> > > > > > by
> > > > > > > >>> file
> > > > > > > >>> >> page size or some other layout option that I need to
> bump
> > as
> > > > > well?
> > > > > > > >>> >>
> > > > > > > >>> >> I haven't seen my discussion during my time in the
> > community
> > > > but
> > > > > > > >>> maybe it
> > > > > > > >>> >> > was discussed in the past.  I think the main challenge
> > > here
> > > > is
> > > > > > > that
> > > > > > > >>> >> pages
> > > > > > > >>> >> > are either dictionary encoded or not.  I'd guess to
> make
> > > > this
> > > > > > > >>> practical
> > > > > > > >>> >> > there would need to be a new hybrid page type, which I
> > > think
> > > > > it
> > > > > > > >>> might
> > > > > > > >>> >> be an
> > > > > > > >>> >> > interesting idea but quite a bit of work.
> Additionally,
> > > one
> > > > > > would
> > > > > > > >>> >> likely
> > > > > > > >>> >> > need heuristics for when to potentially use the new
> mode
> > > > > versus
> > > > > > a
> > > > > > > >>> >> complete
> > > > > > > >>> >> > fallback.
> > > > > > > >>> >> >
> > > > > > > >>> >>
> > > > > > > >>> >> Got it, thanks for the explanation! It does seem like a
> > huge
> > > > > > amount
> > > > > > > of
> > > > > > > >>> >> work
> > > > > > > >>> >>
> > > > > > > >>> >>
> > > > > > > >>> >> Best,
> > > > > > > >>> >> Claire
> > > > > > > >>> >>
> > > > > > > >>> >>
> > > > > > > >>> >>
> > > > > > > >>> >> On Thu, Sep 14, 2023 at 5:16 PM Micah Kornfield <
> > > > > > > >>> emkornfi...@gmail.com>
> > > > > > > >>> >> wrote:
> > > > > > > >>> >>
> > > > > > > >>> >> > >
> > > > > > > >>> >> > > - What's the heuristic for Parquet dictionary
> writing
> > to
> > > > > > succeed
> > > > > > > >>> for a
> > > > > > > >>> >> > > given column?
> > > > > > > >>> >> >
> > > > > > > >>> >> >
> > > > > > > >>> >> >
> > > > > > > >>> >> >
> > > > > > > >>> >>
> > > > > > > >>>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/blob/9b5a962df3007009a227ef421600197531f970a5/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L117
> > > > > > > >>> >> >
> > > > > > > >>> >> >
> > > > > > > >>> >> > > - Is that heuristic configurable at all?
> > > > > > > >>> >> >
> > > > > > > >>> >> >
> > > > > > > >>> >> > I think you are using the right setting, but maybe it
> is
> > > > > > possible
> > > > > > > >>> the
> > > > > > > >>> >> > strings are still exceeding the threshold (perhaps
> > > > increasing
> > > > > it
> > > > > > > by
> > > > > > > >>> 50x
> > > > > > > >>> >> or
> > > > > > > >>> >> > more to verify)
> > > > > > > >>> >> >
> > > > > > > >>> >> >
> > > > > > > >>> >> > > - For high-cardinality datasets, has the idea of a
> > > > > > > frequency-based
> > > > > > > >>> >> > > dictionary encoding been explored? Say, if the data
> > > > follows
> > > > > a
> > > > > > > >>> certain
> > > > > > > >>> >> > > statistical distribution, we can create a dictionary
> > of
> > > > the
> > > > > > most
> > > > > > > >>> >> frequent
> > > > > > > >>> >> > > values only?
> > > > > > > >>> >> >
> > > > > > > >>> >> > I haven't seen my discussion during my time in the
> > > community
> > > > > but
> > > > > > > >>> maybe
> > > > > > > >>> >> it
> > > > > > > >>> >> > was discussed in the past.  I think the main challenge
> > > here
> > > > is
> > > > > > > that
> > > > > > > >>> >> pages
> > > > > > > >>> >> > are either dictionary encoded or not.  I'd guess to
> make
> > > > this
> > > > > > > >>> practical
> > > > > > > >>> >> > there would need to be a new hybrid page type, which I
> > > think
> > > > > it
> > > > > > > >>> might
> > > > > > > >>> >> be an
> > > > > > > >>> >> > interesting idea but quite a bit of work.
> Additionally,
> > > one
> > > > > > would
> > > > > > > >>> >> likely
> > > > > > > >>> >> > need heuristics for when to potentially use the new
> mode
> > > > > versus
> > > > > > a
> > > > > > > >>> >> complete
> > > > > > > >>> >> > fallback.
> > > > > > > >>> >> >
> > > > > > > >>> >> > Cheers,
> > > > > > > >>> >> > Micah
> > > > > > > >>> >> >
> > > > > > > >>> >> > On Thu, Sep 14, 2023 at 12:07 PM Claire McGinty <
> > > > > > > >>> >> > claire.d.mcgi...@gmail.com>
> > > > > > > >>> >> > wrote:
> > > > > > > >>> >> >
> > > > > > > >>> >> > > Hi dev@,
> > > > > > > >>> >> > >
> > > > > > > >>> >> > > I'm running some benchmarking on Parquet read/write
> > > > > > performance
> > > > > > > >>> and
> > > > > > > >>> >> have
> > > > > > > >>> >> > a
> > > > > > > >>> >> > > few questions about how dictionary encoding works
> > under
> > > > the
> > > > > > > hood.
> > > > > > > >>> Let
> > > > > > > >>> >> me
> > > > > > > >>> >> > > know if there's a better channel for this :)
> > > > > > > >>> >> > >
> > > > > > > >>> >> > > My test case uses parquet-avro, where I'm writing a
> > > single
> > > > > > file
> > > > > > > >>> >> > containing
> > > > > > > >>> >> > > 5 million records. Each record has a single column,
> an
> > > > Avro
> > > > > > > String
> > > > > > > >>> >> field
> > > > > > > >>> >> > > (Parquet binary field). I ran two configurations of
> > base
> > > > > > setup:
> > > > > > > >>> in the
> > > > > > > >>> >> > > first case, the string field has 5,000 possible
> unique
> > > > > values.
> > > > > > > In
> > > > > > > >>> the
> > > > > > > >>> >> > > second case, it has 50,000 unique values.
> > > > > > > >>> >> > >
> > > > > > > >>> >> > > In the first case (5k unique values), I used
> > > parquet-tools
> > > > > to
> > > > > > > >>> inspect
> > > > > > > >>> >> the
> > > > > > > >>> >> > > file metadata and found that a dictionary had been
> > > > written:
> > > > > > > >>> >> > >
> > > > > > > >>> >> > > % parquet-tools meta testdata-case1.parquet
> > > > > > > >>> >> > > > file schema:  testdata.TestRecord
> > > > > > > >>> >> > > >
> > > > > > > >>> >> > > >
> > > > > > > >>> >> > >
> > > > > > > >>> >> >
> > > > > > > >>> >>
> > > > > > > >>>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> --------------------------------------------------------------------------------
> > > > > > > >>> >> > > > stringField:  REQUIRED BINARY L:STRING R:0 D:0
> > > > > > > >>> >> > > > row group 1:  RC:5000001 TS:18262874 OFFSET:4
> > > > > > > >>> >> > > >
> > > > > > > >>> >> > > >
> > > > > > > >>> >> > >
> > > > > > > >>> >> >
> > > > > > > >>> >>
> > > > > > > >>>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> --------------------------------------------------------------------------------
> > > > > > > >>> >> > > > stringField:   BINARY UNCOMPRESSED DO:4 FPO:38918
> > > > > > > >>> >> > SZ:8181452/8181452/1.00
> > > > > > > >>> >> > > > VC:5000001 ENC:BIT_PACKED,PLAIN_DICTIONARY
> ST:[min:
> > 0,
> > > > > max:
> > > > > > > 999,
> > > > > > > >>> >> > > num_nulls:
> > > > > > > >>> >> > > > 0]
> > > > > > > >>> >> > >
> > > > > > > >>> >> > >
> > > > > > > >>> >> > > But in the second case (50k unique values),
> > > parquet-tools
> > > > > > shows
> > > > > > > >>> that
> > > > > > > >>> >> no
> > > > > > > >>> >> > > dictionary gets created, and the file size is *much*
> > > > bigger:
> > > > > > > >>> >> > >
> > > > > > > >>> >> > > % parquet-tools meta testdata-case2.parquet
> > > > > > > >>> >> > > > file schema:  testdata.TestRecord
> > > > > > > >>> >> > > >
> > > > > > > >>> >> > > >
> > > > > > > >>> >> > >
> > > > > > > >>> >> >
> > > > > > > >>> >>
> > > > > > > >>>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> --------------------------------------------------------------------------------
> > > > > > > >>> >> > > > stringField:  REQUIRED BINARY L:STRING R:0 D:0
> > > > > > > >>> >> > > > row group 1:  RC:5000001 TS:18262874 OFFSET:4
> > > > > > > >>> >> > > >
> > > > > > > >>> >> > > >
> > > > > > > >>> >> > >
> > > > > > > >>> >> >
> > > > > > > >>> >>
> > > > > > > >>>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> --------------------------------------------------------------------------------
> > > > > > > >>> >> > > > stringField:  BINARY UNCOMPRESSED DO:0 FPO:4
> > > > > > > >>> >> SZ:43896278/43896278/1.00
> > > > > > > >>> >> > > > VC:5000001 ENC:PLAIN,BIT_PACKED ST:[min: 0, max:
> > 9999,
> > > > > > > >>> num_nulls: 0]
> > > > > > > >>> >> > >
> > > > > > > >>> >> > >
> > > > > > > >>> >> > > (I created a gist of my test reproduction here
> > > > > > > >>> >> > > <
> > > > > > > >>> >>
> > > > > > > >>>
> > > > > >
> > > https://gist.github.com/clairemcginty/c3c0be85f51bc23db45a75e8d8a18806
> > > > > > > >>> >> > >.)
> > > > > > > >>> >> > >
> > > > > > > >>> >> > > Based on this, I'm guessing there's some tip-over
> > point
> > > > > after
> > > > > > > >>> which
> > > > > > > >>> >> > Parquet
> > > > > > > >>> >> > > will give up on writing a dictionary for a given
> > column?
> > > > > After
> > > > > > > >>> reading
> > > > > > > >>> >> > > the Configuration
> > > > > > > >>> >> > > docs
> > > > > > > >>> >> > > <
> > > > > > > >>> >> >
> > > > > > > >>> >>
> > > > > > > >>>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md
> > > > > > > >>> >> > > >,
> > > > > > > >>> >> > > I tried increasing the dictionary page size
> > > configuration
> > > > > 5x,
> > > > > > > >>> with the
> > > > > > > >>> >> > same
> > > > > > > >>> >> > > result (no dictionary created).
> > > > > > > >>> >> > >
> > > > > > > >>> >> > > So to summarize, my questions are:
> > > > > > > >>> >> > >
> > > > > > > >>> >> > > - What's the heuristic for Parquet dictionary
> writing
> > to
> > > > > > succeed
> > > > > > > >>> for a
> > > > > > > >>> >> > > given column?
> > > > > > > >>> >> > > - Is that heuristic configurable at all?
> > > > > > > >>> >> > > - For high-cardinality datasets, has the idea of a
> > > > > > > frequency-based
> > > > > > > >>> >> > > dictionary encoding been explored? Say, if the data
> > > > follows
> > > > > a
> > > > > > > >>> certain
> > > > > > > >>> >> > > statistical distribution, we can create a dictionary
> > of
> > > > the
> > > > > > most
> > > > > > > >>> >> frequent
> > > > > > > >>> >> > > values only?
> > > > > > > >>> >> > >
> > > > > > > >>> >> > > Thanks for your time!
> > > > > > > >>> >> > > - Claire
> > > > > > > >>> >> > >
> > > > > > > >>> >> >
> > > > > > > >>> >>
> > > > > > > >>> >
> > > > > > > >>>
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Aaron Niskode-Dossett, Data Engineering -- Etsy
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Parquet dictionary size limits?

Reply via email to