Hi all,

Just to follow up, I ran some benchmarks with an added Configuration option
to set a desired "compression ratio" param, which you can see here
<https://github.com/clairemcginty/parquet-benchmarks/blob/main/write_results.md#max-dictionary-compression-ratio-option>,
on a variety of data layouts (distribution, sorting, cardinality etc). I
also have a table of comparisons
<https://github.com/clairemcginty/parquet-benchmarks/blob/main/write_results.md#overall-comparison>
using
the latest 0.14.0-SNAPSHOT as a baseline. These are my takeaways:

   - The compression ratio param doesn't benefit data that's in sorted
   order (IMO, because even without the addition of this param, sorted columns
   are more likely to produce efficient dict encodings).
   - On shuffled (non-sorted) data, setting the compression ratio param
   produces a much better *uncompressed* file result (in one case, 39MB to
   the baseline 102MB). However, after applying a file-level compression
   algorithm such as ZSTD, the baseline and ratio-param results turn out about
   pretty much equal (within 5% margin). I think this makes sense, since in
   Parquet 1.0 dictionaries are not encoded, so ZSTD must be better at
   compressing many repeated column values across a file than it is at
   compressing dictionaries across all pages.
   - The compression ratio param works best with larger page sizes (10 mb
   or 50mb) with large-ish dictionary page sizes (10mb).

Overall, I think the equalizing behavior of file-level compression (ZSTD)
makes it not worth it to add a configuration option for dictionary
compression :) Thanks for all of your input on this -- if nothing else, the
benchmarks are a really interesting look at how important data layout is to
overall file size!

Best,
Claire

On Thu, Sep 21, 2023 at 2:12 PM Claire McGinty <[email protected]>
wrote:

> Hmm, like a flag to basically turn off the isCompressionSatisfying check
> per-column? That might be simplest!
>
> So to summarize, a column will not write a dictionary encoding when either:
>
> (1) `parquet.enable.dictionary` is set to False
> (2) # of distinct values in a column chunk exceeds
> `DictionaryValuesWriter#MAX_DICTIONARY_VALUES` (currently set to
> Integer.MAX_VALUE)
> (3) Total encoded bytes in a dictionary exceed the value of
> `parquet.dictionary.page.size`
> (4) Desired compression ratio (as a measure of # distinct values : total #
> values) is not achieved
>
> I might try out various options for making (4) configurable, starting with
> your suggestion, and testing them out on more realistic data distributions.
> Will try to return to this thread with my results in a few days :)
>
> Best,
> Claire
>
>
> On Thu, Sep 21, 2023 at 8:48 AM Gang Wu <[email protected]> wrote:
>
>> The current implementation only checks the first page, which is
>> vulnerable in many cases. I think your suggestion makes sense.
>> However, there is no one-fit-for-all solution. How about simply
>> adding a flag to enforce dictionary encoding to a specific column?
>>
>>
>> On Thu, Sep 21, 2023 at 1:08 AM Claire McGinty <
>> [email protected]>
>> wrote:
>>
>> > I think I figured it out! The dictionaryByteSize == 0 was a red
>> herring; I
>> > was looking at an IntegerDictionaryValuesWriter for an empty column
>> rather
>> > than my high-cardinality column. Your analysis of the situation was
>> > right--it was just that in the first page, there weren't enough distinct
>> > values to pass the check.
>> >
>> > I wonder if we could maybe make this value configurable per-column?
>> Either:
>> >
>> > - A desired ratio of distinct values / total values, on a scale of 0-1.0
>> > - Number of pages to check for compression before falling back
>> >
>> > Let me know what you think!
>> >
>> > Best,
>> > Claire
>> >
>> > On Wed, Sep 20, 2023 at 9:37 AM Gang Wu <[email protected]> wrote:
>> >
>> > > I don't understand why you get encodedSize == 1, dictionaryByteSize
>> == 0
>> > > and rawSize == 0 in the first page check. It seems that the page does
>> not
>> > > have any meaning values. Could you please check how many values are
>> > > written before the page check?
>> > >
>> > > On Thu, Sep 21, 2023 at 12:12 AM Claire McGinty <
>> > > [email protected]>
>> > > wrote:
>> > >
>> > > > Hey Gang,
>> > > >
>> > > > Thanks for the followup! I see what you're saying where it's
>> sometimes
>> > > just
>> > > > bad luck with what ends up in the first page. The intuition seems
>> like
>> > a
>> > > > larger page size should produce a better encoding in this case... I
>> > > updated
>> > > > my branch
>> > > > <
>> > > >
>> > >
>> >
>> https://github.com/apache/parquet-mr/compare/master...clairemcginty:parquet-mr:dict-size-repro?expand=1
>> > > > >
>> > > > to
>> > > > add a test with a page size/dict page size of 10MB and am seeing the
>> > same
>> > > > failure, though.
>> > > >
>> > > > Something seems kind of odd actually -- when I stepped through the
>> > test I
>> > > > added w/ debugger, it falls back after invoking
>> isCompressionSatisfying
>> > > > with encodedSize == 1, dictionaryByteSize == 0 and rawSize == 0; 1
>> + 0
>> > <
>> > > 1
>> > > > returns true. (You can also see this in the System.out logs I
>> added, in
>> > > the
>> > > > branch's GHA run logs). This doesn't seem right to me -- does
>> > > > isCompressionSatsifying need an extra check to make sure the
>> > > > dictionary isn't empty?
>> > > >
>> > > > Also, thanks, Aaron! I got into this while running some
>> > micro-benchmarks
>> > > on
>> > > > Parquet reads when various dictionary/bloom filter/encoding options
>> are
>> > > > configured. Happy to share out when I'm done.
>> > > >
>> > > > Best,
>> > > > Claire
>> > > >
>> > > > On Tue, Sep 19, 2023 at 9:06 PM Gang Wu <[email protected]> wrote:
>> > > >
>> > > > > Thanks for the investigation!
>> > > > >
>> > > > > I think the check below makes sense for a single page:
>> > > > >   @Override
>> > > > >   public boolean isCompressionSatisfying(long rawSize, long
>> > > encodedSize)
>> > > > {
>> > > > >     return (encodedSize + dictionaryByteSize) < rawSize;
>> > > > >   }
>> > > > >
>> > > > > The problem is that the fallback check is only performed on the
>> first
>> > > > page.
>> > > > > In the first page, all values in that page may be distinct, so it
>> > will
>> > > > > unlikely
>> > > > > pass the isCompressionSatisfying check.
>> > > > >
>> > > > > Best,
>> > > > > Gang
>> > > > >
>> > > > >
>> > > > > On Wed, Sep 20, 2023 at 5:04 AM Aaron Niskode-Dossett
>> > > > > <[email protected]> wrote:
>> > > > >
>> > > > > > Claire, thank you for your research and examples on this topic,
>> > I've
>> > > > > > learned a lot.  My hunch is that your change would be a good
>> one,
>> > but
>> > > > I'm
>> > > > > > not an expert (and more to the point, not a committer).  I'm
>> > looking
>> > > > > > forward to learning more as this discussion continues.
>> > > > > >
>> > > > > > Thank you again, Aaron
>> > > > > >
>> > > > > > On Tue, Sep 19, 2023 at 2:48 PM Claire McGinty <
>> > > > > [email protected]
>> > > > > > >
>> > > > > > wrote:
>> > > > > >
>> > > > > > > I created a quick branch
>> > > > > > > <
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://github.com/apache/parquet-mr/compare/master...clairemcginty:parquet-mr:dict-size-repro?expand=1
>> > > > > > > >
>> > > > > > > to reproduce what I'm seeing -- the test shows that an Int
>> column
>> > > > with
>> > > > > > > cardinality 100 successfully results in a dict encoding, but
>> an
>> > int
>> > > > > > column
>> > > > > > > with cardinality 10,000 falls back and doesn't create a dict
>> > > > encoding.
>> > > > > > This
>> > > > > > > seems like a low threshold given the 1MB dictionary page size,
>> > so I
>> > > > > just
>> > > > > > > wanted to check whether this is expected or not :)
>> > > > > > >
>> > > > > > > Best,
>> > > > > > > Claire
>> > > > > > >
>> > > > > > > On Tue, Sep 19, 2023 at 9:35 AM Claire McGinty <
>> > > > > > [email protected]
>> > > > > > > >
>> > > > > > > wrote:
>> > > > > > >
>> > > > > > > > Hi, just wanted to follow up on this!
>> > > > > > > >
>> > > > > > > > I ran a debugger to find out why my column wasn't ending up
>> > with
>> > > a
>> > > > > > > > dictionary encoding and it turns out that even though
>> > > > > > > > DictionaryValuesWriter#shouldFallback()
>> > > > > > > > <
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L117
>> > > > > > > >
>> > > > > > > > always returned false (dictionaryByteSize was always less
>> than
>> > my
>> > > > > > > > configured page size),
>> > > > DictionaryValuesWriter#isCompressionSatisfying
>> > > > > > > > <
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L125
>> > > > > > >
>> > > > > > > was
>> > > > > > > > what was causing Parquet to switch
>> > > > > > > > <
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/fallback/FallbackValuesWriter.java#L75
>> > > > > > > >
>> > > > > > > > back to the fallback, non-dict writer.
>> > > > > > > >
>> > > > > > > > From what I can tell, this check compares the total byte
>> size
>> > of
>> > > > > > > > *every* element with the byte size of each *distinct*
>> element
>> > as
>> > > a
>> > > > > kind
>> > > > > > > of
>> > > > > > > > proxy for encoding efficiency.... however, it seems strange
>> > that
>> > > > this
>> > > > > > > check
>> > > > > > > > can cause the writer to fall back even if the total encoded
>> > dict
>> > > > size
>> > > > > > is
>> > > > > > > > far below the configured dictionary page size. Out of
>> > curiosity,
>> > > I
>> > > > > > > modified
>> > > > > > > > DictionaryValuesWriter#isCompressionSatisfying
>> > > > > > > > <
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L125
>> > > > > > >
>> > > > > > > to
>> > > > > > > > also check whether total byte size was less than dictionary
>> max
>> > > > size
>> > > > > > and
>> > > > > > > > re-ran my Parquet write with a local snapshot, and my file
>> size
>> > > > > dropped
>> > > > > > > 50%.
>> > > > > > > >
>> > > > > > > > Best,
>> > > > > > > > Claire
>> > > > > > > >
>> > > > > > > > On Mon, Sep 18, 2023 at 9:16 AM Claire McGinty <
>> > > > > > > [email protected]>
>> > > > > > > > wrote:
>> > > > > > > >
>> > > > > > > >> Oh, interesting! I'm setting it via the
>> > > > > > > >> ParquetWriter#withDictionaryPageSize method, and I do see
>> the
>> > > > > overall
>> > > > > > > file
>> > > > > > > >> size increasing when I bump the value. I'll look into it a
>> bit
>> > > > more
>> > > > > --
>> > > > > > > it
>> > > > > > > >> would be helpful for some cases where the # unique values
>> in a
>> > > > > column
>> > > > > > is
>> > > > > > > >> just over the size limit.
>> > > > > > > >>
>> > > > > > > >> - Claire
>> > > > > > > >>
>> > > > > > > >> On Fri, Sep 15, 2023 at 9:54 AM Micah Kornfield <
>> > > > > > [email protected]>
>> > > > > > > >> wrote:
>> > > > > > > >>
>> > > > > > > >>> I'll note there is also a check for encoding effectiveness
>> > [1]
>> > > > that
>> > > > > > > could
>> > > > > > > >>> come into play but I'd guess that isn't the case here.
>> > > > > > > >>>
>> > > > > > > >>> [1]
>> > > > > > > >>>
>> > > > > > > >>>
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://github.com/apache/parquet-mr/blob/9b5a962df3007009a227ef421600197531f970a5/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L124
>> > > > > > > >>>
>> > > > > > > >>> On Fri, Sep 15, 2023 at 9:51 AM Micah Kornfield <
>> > > > > > [email protected]
>> > > > > > > >
>> > > > > > > >>> wrote:
>> > > > > > > >>>
>> > > > > > > >>> > I'm glad I was looking at the right setting for
>> dictionary
>> > > > size.
>> > > > > I
>> > > > > > > just
>> > > > > > > >>> >> tried it out with 10x, 50x, and even total file size,
>> > > though,
>> > > > > and
>> > > > > > > >>> still am
>> > > > > > > >>> >> not seeing a dictionary get created. Is it possible
>> it's
>> > > > bounded
>> > > > > > by
>> > > > > > > >>> file
>> > > > > > > >>> >> page size or some other layout option that I need to
>> bump
>> > as
>> > > > > well?
>> > > > > > > >>> >
>> > > > > > > >>> >
>> > > > > > > >>> > Sorry I'm less familiar with parquet-mr, hopefully
>> someone
>> > > else
>> > > > > to
>> > > > > > > >>> chime
>> > > > > > > >>> > in.  If I had to guess, maybe somehow the config value
>> > isn't
>> > > > > making
>> > > > > > > it
>> > > > > > > >>> to
>> > > > > > > >>> > the writer (but there could also be something else at
>> > play).
>> > > > > > > >>> >
>> > > > > > > >>> > On Fri, Sep 15, 2023 at 9:33 AM Claire McGinty <
>> > > > > > > >>> [email protected]>
>> > > > > > > >>> > wrote:
>> > > > > > > >>> >
>> > > > > > > >>> >> Thanks so much, Micah!
>> > > > > > > >>> >>
>> > > > > > > >>> >> I think you are using the right setting, but maybe it
>> is
>> > > > > possible
>> > > > > > > the
>> > > > > > > >>> >> > strings are still exceeding the threshold (perhaps
>> > > > increasing
>> > > > > it
>> > > > > > > by
>> > > > > > > >>> 50x
>> > > > > > > >>> >> or
>> > > > > > > >>> >> > more to verify)
>> > > > > > > >>> >>
>> > > > > > > >>> >>
>> > > > > > > >>> >> I'm glad I was looking at the right setting for
>> dictionary
>> > > > > size. I
>> > > > > > > >>> just
>> > > > > > > >>> >> tried it out with 10x, 50x, and even total file size,
>> > > though,
>> > > > > and
>> > > > > > > >>> still am
>> > > > > > > >>> >> not seeing a dictionary get created. Is it possible
>> it's
>> > > > bounded
>> > > > > > by
>> > > > > > > >>> file
>> > > > > > > >>> >> page size or some other layout option that I need to
>> bump
>> > as
>> > > > > well?
>> > > > > > > >>> >>
>> > > > > > > >>> >> I haven't seen my discussion during my time in the
>> > community
>> > > > but
>> > > > > > > >>> maybe it
>> > > > > > > >>> >> > was discussed in the past.  I think the main
>> challenge
>> > > here
>> > > > is
>> > > > > > > that
>> > > > > > > >>> >> pages
>> > > > > > > >>> >> > are either dictionary encoded or not.  I'd guess to
>> make
>> > > > this
>> > > > > > > >>> practical
>> > > > > > > >>> >> > there would need to be a new hybrid page type, which
>> I
>> > > think
>> > > > > it
>> > > > > > > >>> might
>> > > > > > > >>> >> be an
>> > > > > > > >>> >> > interesting idea but quite a bit of work.
>> Additionally,
>> > > one
>> > > > > > would
>> > > > > > > >>> >> likely
>> > > > > > > >>> >> > need heuristics for when to potentially use the new
>> mode
>> > > > > versus
>> > > > > > a
>> > > > > > > >>> >> complete
>> > > > > > > >>> >> > fallback.
>> > > > > > > >>> >> >
>> > > > > > > >>> >>
>> > > > > > > >>> >> Got it, thanks for the explanation! It does seem like a
>> > huge
>> > > > > > amount
>> > > > > > > of
>> > > > > > > >>> >> work
>> > > > > > > >>> >>
>> > > > > > > >>> >>
>> > > > > > > >>> >> Best,
>> > > > > > > >>> >> Claire
>> > > > > > > >>> >>
>> > > > > > > >>> >>
>> > > > > > > >>> >>
>> > > > > > > >>> >> On Thu, Sep 14, 2023 at 5:16 PM Micah Kornfield <
>> > > > > > > >>> [email protected]>
>> > > > > > > >>> >> wrote:
>> > > > > > > >>> >>
>> > > > > > > >>> >> > >
>> > > > > > > >>> >> > > - What's the heuristic for Parquet dictionary
>> writing
>> > to
>> > > > > > succeed
>> > > > > > > >>> for a
>> > > > > > > >>> >> > > given column?
>> > > > > > > >>> >> >
>> > > > > > > >>> >> >
>> > > > > > > >>> >> >
>> > > > > > > >>> >> >
>> > > > > > > >>> >>
>> > > > > > > >>>
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://github.com/apache/parquet-mr/blob/9b5a962df3007009a227ef421600197531f970a5/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L117
>> > > > > > > >>> >> >
>> > > > > > > >>> >> >
>> > > > > > > >>> >> > > - Is that heuristic configurable at all?
>> > > > > > > >>> >> >
>> > > > > > > >>> >> >
>> > > > > > > >>> >> > I think you are using the right setting, but maybe
>> it is
>> > > > > > possible
>> > > > > > > >>> the
>> > > > > > > >>> >> > strings are still exceeding the threshold (perhaps
>> > > > increasing
>> > > > > it
>> > > > > > > by
>> > > > > > > >>> 50x
>> > > > > > > >>> >> or
>> > > > > > > >>> >> > more to verify)
>> > > > > > > >>> >> >
>> > > > > > > >>> >> >
>> > > > > > > >>> >> > > - For high-cardinality datasets, has the idea of a
>> > > > > > > frequency-based
>> > > > > > > >>> >> > > dictionary encoding been explored? Say, if the data
>> > > > follows
>> > > > > a
>> > > > > > > >>> certain
>> > > > > > > >>> >> > > statistical distribution, we can create a
>> dictionary
>> > of
>> > > > the
>> > > > > > most
>> > > > > > > >>> >> frequent
>> > > > > > > >>> >> > > values only?
>> > > > > > > >>> >> >
>> > > > > > > >>> >> > I haven't seen my discussion during my time in the
>> > > community
>> > > > > but
>> > > > > > > >>> maybe
>> > > > > > > >>> >> it
>> > > > > > > >>> >> > was discussed in the past.  I think the main
>> challenge
>> > > here
>> > > > is
>> > > > > > > that
>> > > > > > > >>> >> pages
>> > > > > > > >>> >> > are either dictionary encoded or not.  I'd guess to
>> make
>> > > > this
>> > > > > > > >>> practical
>> > > > > > > >>> >> > there would need to be a new hybrid page type, which
>> I
>> > > think
>> > > > > it
>> > > > > > > >>> might
>> > > > > > > >>> >> be an
>> > > > > > > >>> >> > interesting idea but quite a bit of work.
>> Additionally,
>> > > one
>> > > > > > would
>> > > > > > > >>> >> likely
>> > > > > > > >>> >> > need heuristics for when to potentially use the new
>> mode
>> > > > > versus
>> > > > > > a
>> > > > > > > >>> >> complete
>> > > > > > > >>> >> > fallback.
>> > > > > > > >>> >> >
>> > > > > > > >>> >> > Cheers,
>> > > > > > > >>> >> > Micah
>> > > > > > > >>> >> >
>> > > > > > > >>> >> > On Thu, Sep 14, 2023 at 12:07 PM Claire McGinty <
>> > > > > > > >>> >> > [email protected]>
>> > > > > > > >>> >> > wrote:
>> > > > > > > >>> >> >
>> > > > > > > >>> >> > > Hi dev@,
>> > > > > > > >>> >> > >
>> > > > > > > >>> >> > > I'm running some benchmarking on Parquet read/write
>> > > > > > performance
>> > > > > > > >>> and
>> > > > > > > >>> >> have
>> > > > > > > >>> >> > a
>> > > > > > > >>> >> > > few questions about how dictionary encoding works
>> > under
>> > > > the
>> > > > > > > hood.
>> > > > > > > >>> Let
>> > > > > > > >>> >> me
>> > > > > > > >>> >> > > know if there's a better channel for this :)
>> > > > > > > >>> >> > >
>> > > > > > > >>> >> > > My test case uses parquet-avro, where I'm writing a
>> > > single
>> > > > > > file
>> > > > > > > >>> >> > containing
>> > > > > > > >>> >> > > 5 million records. Each record has a single
>> column, an
>> > > > Avro
>> > > > > > > String
>> > > > > > > >>> >> field
>> > > > > > > >>> >> > > (Parquet binary field). I ran two configurations of
>> > base
>> > > > > > setup:
>> > > > > > > >>> in the
>> > > > > > > >>> >> > > first case, the string field has 5,000 possible
>> unique
>> > > > > values.
>> > > > > > > In
>> > > > > > > >>> the
>> > > > > > > >>> >> > > second case, it has 50,000 unique values.
>> > > > > > > >>> >> > >
>> > > > > > > >>> >> > > In the first case (5k unique values), I used
>> > > parquet-tools
>> > > > > to
>> > > > > > > >>> inspect
>> > > > > > > >>> >> the
>> > > > > > > >>> >> > > file metadata and found that a dictionary had been
>> > > > written:
>> > > > > > > >>> >> > >
>> > > > > > > >>> >> > > % parquet-tools meta testdata-case1.parquet
>> > > > > > > >>> >> > > > file schema:  testdata.TestRecord
>> > > > > > > >>> >> > > >
>> > > > > > > >>> >> > > >
>> > > > > > > >>> >> > >
>> > > > > > > >>> >> >
>> > > > > > > >>> >>
>> > > > > > > >>>
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> --------------------------------------------------------------------------------
>> > > > > > > >>> >> > > > stringField:  REQUIRED BINARY L:STRING R:0 D:0
>> > > > > > > >>> >> > > > row group 1:  RC:5000001 TS:18262874 OFFSET:4
>> > > > > > > >>> >> > > >
>> > > > > > > >>> >> > > >
>> > > > > > > >>> >> > >
>> > > > > > > >>> >> >
>> > > > > > > >>> >>
>> > > > > > > >>>
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> --------------------------------------------------------------------------------
>> > > > > > > >>> >> > > > stringField:   BINARY UNCOMPRESSED DO:4 FPO:38918
>> > > > > > > >>> >> > SZ:8181452/8181452/1.00
>> > > > > > > >>> >> > > > VC:5000001 ENC:BIT_PACKED,PLAIN_DICTIONARY
>> ST:[min:
>> > 0,
>> > > > > max:
>> > > > > > > 999,
>> > > > > > > >>> >> > > num_nulls:
>> > > > > > > >>> >> > > > 0]
>> > > > > > > >>> >> > >
>> > > > > > > >>> >> > >
>> > > > > > > >>> >> > > But in the second case (50k unique values),
>> > > parquet-tools
>> > > > > > shows
>> > > > > > > >>> that
>> > > > > > > >>> >> no
>> > > > > > > >>> >> > > dictionary gets created, and the file size is
>> *much*
>> > > > bigger:
>> > > > > > > >>> >> > >
>> > > > > > > >>> >> > > % parquet-tools meta testdata-case2.parquet
>> > > > > > > >>> >> > > > file schema:  testdata.TestRecord
>> > > > > > > >>> >> > > >
>> > > > > > > >>> >> > > >
>> > > > > > > >>> >> > >
>> > > > > > > >>> >> >
>> > > > > > > >>> >>
>> > > > > > > >>>
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> --------------------------------------------------------------------------------
>> > > > > > > >>> >> > > > stringField:  REQUIRED BINARY L:STRING R:0 D:0
>> > > > > > > >>> >> > > > row group 1:  RC:5000001 TS:18262874 OFFSET:4
>> > > > > > > >>> >> > > >
>> > > > > > > >>> >> > > >
>> > > > > > > >>> >> > >
>> > > > > > > >>> >> >
>> > > > > > > >>> >>
>> > > > > > > >>>
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> --------------------------------------------------------------------------------
>> > > > > > > >>> >> > > > stringField:  BINARY UNCOMPRESSED DO:0 FPO:4
>> > > > > > > >>> >> SZ:43896278/43896278/1.00
>> > > > > > > >>> >> > > > VC:5000001 ENC:PLAIN,BIT_PACKED ST:[min: 0, max:
>> > 9999,
>> > > > > > > >>> num_nulls: 0]
>> > > > > > > >>> >> > >
>> > > > > > > >>> >> > >
>> > > > > > > >>> >> > > (I created a gist of my test reproduction here
>> > > > > > > >>> >> > > <
>> > > > > > > >>> >>
>> > > > > > > >>>
>> > > > > >
>> > >
>> https://gist.github.com/clairemcginty/c3c0be85f51bc23db45a75e8d8a18806
>> > > > > > > >>> >> > >.)
>> > > > > > > >>> >> > >
>> > > > > > > >>> >> > > Based on this, I'm guessing there's some tip-over
>> > point
>> > > > > after
>> > > > > > > >>> which
>> > > > > > > >>> >> > Parquet
>> > > > > > > >>> >> > > will give up on writing a dictionary for a given
>> > column?
>> > > > > After
>> > > > > > > >>> reading
>> > > > > > > >>> >> > > the Configuration
>> > > > > > > >>> >> > > docs
>> > > > > > > >>> >> > > <
>> > > > > > > >>> >> >
>> > > > > > > >>> >>
>> > > > > > > >>>
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md
>> > > > > > > >>> >> > > >,
>> > > > > > > >>> >> > > I tried increasing the dictionary page size
>> > > configuration
>> > > > > 5x,
>> > > > > > > >>> with the
>> > > > > > > >>> >> > same
>> > > > > > > >>> >> > > result (no dictionary created).
>> > > > > > > >>> >> > >
>> > > > > > > >>> >> > > So to summarize, my questions are:
>> > > > > > > >>> >> > >
>> > > > > > > >>> >> > > - What's the heuristic for Parquet dictionary
>> writing
>> > to
>> > > > > > succeed
>> > > > > > > >>> for a
>> > > > > > > >>> >> > > given column?
>> > > > > > > >>> >> > > - Is that heuristic configurable at all?
>> > > > > > > >>> >> > > - For high-cardinality datasets, has the idea of a
>> > > > > > > frequency-based
>> > > > > > > >>> >> > > dictionary encoding been explored? Say, if the data
>> > > > follows
>> > > > > a
>> > > > > > > >>> certain
>> > > > > > > >>> >> > > statistical distribution, we can create a
>> dictionary
>> > of
>> > > > the
>> > > > > > most
>> > > > > > > >>> >> frequent
>> > > > > > > >>> >> > > values only?
>> > > > > > > >>> >> > >
>> > > > > > > >>> >> > > Thanks for your time!
>> > > > > > > >>> >> > > - Claire
>> > > > > > > >>> >> > >
>> > > > > > > >>> >> >
>> > > > > > > >>> >>
>> > > > > > > >>> >
>> > > > > > > >>>
>> > > > > > > >>
>> > > > > > >
>> > > > > >
>> > > > > >
>> > > > > > --
>> > > > > > Aaron Niskode-Dossett, Data Engineering -- Etsy
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>

Reply via email to