Hi, just wanted to follow up on this!

I ran a debugger to find out why my column wasn't ending up with a
dictionary encoding and it turns out that even though
DictionaryValuesWriter#shouldFallback()
<https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L117>
always returned false (dictionaryByteSize was always less than my
configured page size), DictionaryValuesWriter#isCompressionSatisfying
<https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L125>
was
what was causing Parquet to switch
<https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/fallback/FallbackValuesWriter.java#L75>
back to the fallback, non-dict writer.

>From what I can tell, this check compares the total byte size of
*every* element with the byte size of each *distinct* element as a kind of
proxy for encoding efficiency.... however, it seems strange that this check
can cause the writer to fall back even if the total encoded dict size is
far below the configured dictionary page size. Out of curiosity, I modified
DictionaryValuesWriter#isCompressionSatisfying
<https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L125>
to
also check whether total byte size was less than dictionary max size and
re-ran my Parquet write with a local snapshot, and my file size dropped 50%.

Best,
Claire

On Mon, Sep 18, 2023 at 9:16 AM Claire McGinty <claire.d.mcgi...@gmail.com>
wrote:

> Oh, interesting! I'm setting it via the
> ParquetWriter#withDictionaryPageSize method, and I do see the overall file
> size increasing when I bump the value. I'll look into it a bit more -- it
> would be helpful for some cases where the # unique values in a column is
> just over the size limit.
>
> - Claire
>
> On Fri, Sep 15, 2023 at 9:54 AM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
>
>> I'll note there is also a check for encoding effectiveness [1] that could
>> come into play but I'd guess that isn't the case here.
>>
>> [1]
>>
>> https://github.com/apache/parquet-mr/blob/9b5a962df3007009a227ef421600197531f970a5/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L124
>>
>> On Fri, Sep 15, 2023 at 9:51 AM Micah Kornfield <emkornfi...@gmail.com>
>> wrote:
>>
>> > I'm glad I was looking at the right setting for dictionary size. I just
>> >> tried it out with 10x, 50x, and even total file size, though, and
>> still am
>> >> not seeing a dictionary get created. Is it possible it's bounded by
>> file
>> >> page size or some other layout option that I need to bump as well?
>> >
>> >
>> > Sorry I'm less familiar with parquet-mr, hopefully someone else to chime
>> > in.  If I had to guess, maybe somehow the config value isn't making it
>> to
>> > the writer (but there could also be something else at play).
>> >
>> > On Fri, Sep 15, 2023 at 9:33 AM Claire McGinty <
>> claire.d.mcgi...@gmail.com>
>> > wrote:
>> >
>> >> Thanks so much, Micah!
>> >>
>> >> I think you are using the right setting, but maybe it is possible the
>> >> > strings are still exceeding the threshold (perhaps increasing it by
>> 50x
>> >> or
>> >> > more to verify)
>> >>
>> >>
>> >> I'm glad I was looking at the right setting for dictionary size. I just
>> >> tried it out with 10x, 50x, and even total file size, though, and
>> still am
>> >> not seeing a dictionary get created. Is it possible it's bounded by
>> file
>> >> page size or some other layout option that I need to bump as well?
>> >>
>> >> I haven't seen my discussion during my time in the community but maybe
>> it
>> >> > was discussed in the past.  I think the main challenge here is that
>> >> pages
>> >> > are either dictionary encoded or not.  I'd guess to make this
>> practical
>> >> > there would need to be a new hybrid page type, which I think it might
>> >> be an
>> >> > interesting idea but quite a bit of work.  Additionally, one would
>> >> likely
>> >> > need heuristics for when to potentially use the new mode versus a
>> >> complete
>> >> > fallback.
>> >> >
>> >>
>> >> Got it, thanks for the explanation! It does seem like a huge amount of
>> >> work
>> >>
>> >>
>> >> Best,
>> >> Claire
>> >>
>> >>
>> >>
>> >> On Thu, Sep 14, 2023 at 5:16 PM Micah Kornfield <emkornfi...@gmail.com
>> >
>> >> wrote:
>> >>
>> >> > >
>> >> > > - What's the heuristic for Parquet dictionary writing to succeed
>> for a
>> >> > > given column?
>> >> >
>> >> >
>> >> >
>> >> >
>> >>
>> https://github.com/apache/parquet-mr/blob/9b5a962df3007009a227ef421600197531f970a5/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L117
>> >> >
>> >> >
>> >> > > - Is that heuristic configurable at all?
>> >> >
>> >> >
>> >> > I think you are using the right setting, but maybe it is possible the
>> >> > strings are still exceeding the threshold (perhaps increasing it by
>> 50x
>> >> or
>> >> > more to verify)
>> >> >
>> >> >
>> >> > > - For high-cardinality datasets, has the idea of a frequency-based
>> >> > > dictionary encoding been explored? Say, if the data follows a
>> certain
>> >> > > statistical distribution, we can create a dictionary of the most
>> >> frequent
>> >> > > values only?
>> >> >
>> >> > I haven't seen my discussion during my time in the community but
>> maybe
>> >> it
>> >> > was discussed in the past.  I think the main challenge here is that
>> >> pages
>> >> > are either dictionary encoded or not.  I'd guess to make this
>> practical
>> >> > there would need to be a new hybrid page type, which I think it might
>> >> be an
>> >> > interesting idea but quite a bit of work.  Additionally, one would
>> >> likely
>> >> > need heuristics for when to potentially use the new mode versus a
>> >> complete
>> >> > fallback.
>> >> >
>> >> > Cheers,
>> >> > Micah
>> >> >
>> >> > On Thu, Sep 14, 2023 at 12:07 PM Claire McGinty <
>> >> > claire.d.mcgi...@gmail.com>
>> >> > wrote:
>> >> >
>> >> > > Hi dev@,
>> >> > >
>> >> > > I'm running some benchmarking on Parquet read/write performance and
>> >> have
>> >> > a
>> >> > > few questions about how dictionary encoding works under the hood.
>> Let
>> >> me
>> >> > > know if there's a better channel for this :)
>> >> > >
>> >> > > My test case uses parquet-avro, where I'm writing a single file
>> >> > containing
>> >> > > 5 million records. Each record has a single column, an Avro String
>> >> field
>> >> > > (Parquet binary field). I ran two configurations of base setup: in
>> the
>> >> > > first case, the string field has 5,000 possible unique values. In
>> the
>> >> > > second case, it has 50,000 unique values.
>> >> > >
>> >> > > In the first case (5k unique values), I used parquet-tools to
>> inspect
>> >> the
>> >> > > file metadata and found that a dictionary had been written:
>> >> > >
>> >> > > % parquet-tools meta testdata-case1.parquet
>> >> > > > file schema:  testdata.TestRecord
>> >> > > >
>> >> > > >
>> >> > >
>> >> >
>> >>
>> --------------------------------------------------------------------------------
>> >> > > > stringField:  REQUIRED BINARY L:STRING R:0 D:0
>> >> > > > row group 1:  RC:5000001 TS:18262874 OFFSET:4
>> >> > > >
>> >> > > >
>> >> > >
>> >> >
>> >>
>> --------------------------------------------------------------------------------
>> >> > > > stringField:   BINARY UNCOMPRESSED DO:4 FPO:38918
>> >> > SZ:8181452/8181452/1.00
>> >> > > > VC:5000001 ENC:BIT_PACKED,PLAIN_DICTIONARY ST:[min: 0, max: 999,
>> >> > > num_nulls:
>> >> > > > 0]
>> >> > >
>> >> > >
>> >> > > But in the second case (50k unique values), parquet-tools shows
>> that
>> >> no
>> >> > > dictionary gets created, and the file size is *much* bigger:
>> >> > >
>> >> > > % parquet-tools meta testdata-case2.parquet
>> >> > > > file schema:  testdata.TestRecord
>> >> > > >
>> >> > > >
>> >> > >
>> >> >
>> >>
>> --------------------------------------------------------------------------------
>> >> > > > stringField:  REQUIRED BINARY L:STRING R:0 D:0
>> >> > > > row group 1:  RC:5000001 TS:18262874 OFFSET:4
>> >> > > >
>> >> > > >
>> >> > >
>> >> >
>> >>
>> --------------------------------------------------------------------------------
>> >> > > > stringField:  BINARY UNCOMPRESSED DO:0 FPO:4
>> >> SZ:43896278/43896278/1.00
>> >> > > > VC:5000001 ENC:PLAIN,BIT_PACKED ST:[min: 0, max: 9999,
>> num_nulls: 0]
>> >> > >
>> >> > >
>> >> > > (I created a gist of my test reproduction here
>> >> > > <
>> >> https://gist.github.com/clairemcginty/c3c0be85f51bc23db45a75e8d8a18806
>> >> > >.)
>> >> > >
>> >> > > Based on this, I'm guessing there's some tip-over point after which
>> >> > Parquet
>> >> > > will give up on writing a dictionary for a given column? After
>> reading
>> >> > > the Configuration
>> >> > > docs
>> >> > > <
>> >> >
>> >>
>> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md
>> >> > > >,
>> >> > > I tried increasing the dictionary page size configuration 5x, with
>> the
>> >> > same
>> >> > > result (no dictionary created).
>> >> > >
>> >> > > So to summarize, my questions are:
>> >> > >
>> >> > > - What's the heuristic for Parquet dictionary writing to succeed
>> for a
>> >> > > given column?
>> >> > > - Is that heuristic configurable at all?
>> >> > > - For high-cardinality datasets, has the idea of a frequency-based
>> >> > > dictionary encoding been explored? Say, if the data follows a
>> certain
>> >> > > statistical distribution, we can create a dictionary of the most
>> >> frequent
>> >> > > values only?
>> >> > >
>> >> > > Thanks for your time!
>> >> > > - Claire
>> >> > >
>> >> >
>> >>
>> >
>>
>

Reply via email to