I created a quick branch
<https://github.com/apache/parquet-mr/compare/master...clairemcginty:parquet-mr:dict-size-repro?expand=1>
to reproduce what I'm seeing -- the test shows that an Int column with
cardinality 100 successfully results in a dict encoding, but an int column
with cardinality 10,000 falls back and doesn't create a dict encoding. This
seems like a low threshold given the 1MB dictionary page size, so I just
wanted to check whether this is expected or not :)

Best,
Claire

On Tue, Sep 19, 2023 at 9:35 AM Claire McGinty <claire.d.mcgi...@gmail.com>
wrote:

> Hi, just wanted to follow up on this!
>
> I ran a debugger to find out why my column wasn't ending up with a
> dictionary encoding and it turns out that even though
> DictionaryValuesWriter#shouldFallback()
> <https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L117>
> always returned false (dictionaryByteSize was always less than my
> configured page size), DictionaryValuesWriter#isCompressionSatisfying
> <https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L125>
>  was
> what was causing Parquet to switch
> <https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/fallback/FallbackValuesWriter.java#L75>
> back to the fallback, non-dict writer.
>
> From what I can tell, this check compares the total byte size of
> *every* element with the byte size of each *distinct* element as a kind of
> proxy for encoding efficiency.... however, it seems strange that this check
> can cause the writer to fall back even if the total encoded dict size is
> far below the configured dictionary page size. Out of curiosity, I modified
> DictionaryValuesWriter#isCompressionSatisfying
> <https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L125>
>  to
> also check whether total byte size was less than dictionary max size and
> re-ran my Parquet write with a local snapshot, and my file size dropped 50%.
>
> Best,
> Claire
>
> On Mon, Sep 18, 2023 at 9:16 AM Claire McGinty <claire.d.mcgi...@gmail.com>
> wrote:
>
>> Oh, interesting! I'm setting it via the
>> ParquetWriter#withDictionaryPageSize method, and I do see the overall file
>> size increasing when I bump the value. I'll look into it a bit more -- it
>> would be helpful for some cases where the # unique values in a column is
>> just over the size limit.
>>
>> - Claire
>>
>> On Fri, Sep 15, 2023 at 9:54 AM Micah Kornfield <emkornfi...@gmail.com>
>> wrote:
>>
>>> I'll note there is also a check for encoding effectiveness [1] that could
>>> come into play but I'd guess that isn't the case here.
>>>
>>> [1]
>>>
>>> https://github.com/apache/parquet-mr/blob/9b5a962df3007009a227ef421600197531f970a5/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L124
>>>
>>> On Fri, Sep 15, 2023 at 9:51 AM Micah Kornfield <emkornfi...@gmail.com>
>>> wrote:
>>>
>>> > I'm glad I was looking at the right setting for dictionary size. I just
>>> >> tried it out with 10x, 50x, and even total file size, though, and
>>> still am
>>> >> not seeing a dictionary get created. Is it possible it's bounded by
>>> file
>>> >> page size or some other layout option that I need to bump as well?
>>> >
>>> >
>>> > Sorry I'm less familiar with parquet-mr, hopefully someone else to
>>> chime
>>> > in.  If I had to guess, maybe somehow the config value isn't making it
>>> to
>>> > the writer (but there could also be something else at play).
>>> >
>>> > On Fri, Sep 15, 2023 at 9:33 AM Claire McGinty <
>>> claire.d.mcgi...@gmail.com>
>>> > wrote:
>>> >
>>> >> Thanks so much, Micah!
>>> >>
>>> >> I think you are using the right setting, but maybe it is possible the
>>> >> > strings are still exceeding the threshold (perhaps increasing it by
>>> 50x
>>> >> or
>>> >> > more to verify)
>>> >>
>>> >>
>>> >> I'm glad I was looking at the right setting for dictionary size. I
>>> just
>>> >> tried it out with 10x, 50x, and even total file size, though, and
>>> still am
>>> >> not seeing a dictionary get created. Is it possible it's bounded by
>>> file
>>> >> page size or some other layout option that I need to bump as well?
>>> >>
>>> >> I haven't seen my discussion during my time in the community but
>>> maybe it
>>> >> > was discussed in the past.  I think the main challenge here is that
>>> >> pages
>>> >> > are either dictionary encoded or not.  I'd guess to make this
>>> practical
>>> >> > there would need to be a new hybrid page type, which I think it
>>> might
>>> >> be an
>>> >> > interesting idea but quite a bit of work.  Additionally, one would
>>> >> likely
>>> >> > need heuristics for when to potentially use the new mode versus a
>>> >> complete
>>> >> > fallback.
>>> >> >
>>> >>
>>> >> Got it, thanks for the explanation! It does seem like a huge amount of
>>> >> work
>>> >>
>>> >>
>>> >> Best,
>>> >> Claire
>>> >>
>>> >>
>>> >>
>>> >> On Thu, Sep 14, 2023 at 5:16 PM Micah Kornfield <
>>> emkornfi...@gmail.com>
>>> >> wrote:
>>> >>
>>> >> > >
>>> >> > > - What's the heuristic for Parquet dictionary writing to succeed
>>> for a
>>> >> > > given column?
>>> >> >
>>> >> >
>>> >> >
>>> >> >
>>> >>
>>> https://github.com/apache/parquet-mr/blob/9b5a962df3007009a227ef421600197531f970a5/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L117
>>> >> >
>>> >> >
>>> >> > > - Is that heuristic configurable at all?
>>> >> >
>>> >> >
>>> >> > I think you are using the right setting, but maybe it is possible
>>> the
>>> >> > strings are still exceeding the threshold (perhaps increasing it by
>>> 50x
>>> >> or
>>> >> > more to verify)
>>> >> >
>>> >> >
>>> >> > > - For high-cardinality datasets, has the idea of a frequency-based
>>> >> > > dictionary encoding been explored? Say, if the data follows a
>>> certain
>>> >> > > statistical distribution, we can create a dictionary of the most
>>> >> frequent
>>> >> > > values only?
>>> >> >
>>> >> > I haven't seen my discussion during my time in the community but
>>> maybe
>>> >> it
>>> >> > was discussed in the past.  I think the main challenge here is that
>>> >> pages
>>> >> > are either dictionary encoded or not.  I'd guess to make this
>>> practical
>>> >> > there would need to be a new hybrid page type, which I think it
>>> might
>>> >> be an
>>> >> > interesting idea but quite a bit of work.  Additionally, one would
>>> >> likely
>>> >> > need heuristics for when to potentially use the new mode versus a
>>> >> complete
>>> >> > fallback.
>>> >> >
>>> >> > Cheers,
>>> >> > Micah
>>> >> >
>>> >> > On Thu, Sep 14, 2023 at 12:07 PM Claire McGinty <
>>> >> > claire.d.mcgi...@gmail.com>
>>> >> > wrote:
>>> >> >
>>> >> > > Hi dev@,
>>> >> > >
>>> >> > > I'm running some benchmarking on Parquet read/write performance
>>> and
>>> >> have
>>> >> > a
>>> >> > > few questions about how dictionary encoding works under the hood.
>>> Let
>>> >> me
>>> >> > > know if there's a better channel for this :)
>>> >> > >
>>> >> > > My test case uses parquet-avro, where I'm writing a single file
>>> >> > containing
>>> >> > > 5 million records. Each record has a single column, an Avro String
>>> >> field
>>> >> > > (Parquet binary field). I ran two configurations of base setup:
>>> in the
>>> >> > > first case, the string field has 5,000 possible unique values. In
>>> the
>>> >> > > second case, it has 50,000 unique values.
>>> >> > >
>>> >> > > In the first case (5k unique values), I used parquet-tools to
>>> inspect
>>> >> the
>>> >> > > file metadata and found that a dictionary had been written:
>>> >> > >
>>> >> > > % parquet-tools meta testdata-case1.parquet
>>> >> > > > file schema:  testdata.TestRecord
>>> >> > > >
>>> >> > > >
>>> >> > >
>>> >> >
>>> >>
>>> --------------------------------------------------------------------------------
>>> >> > > > stringField:  REQUIRED BINARY L:STRING R:0 D:0
>>> >> > > > row group 1:  RC:5000001 TS:18262874 OFFSET:4
>>> >> > > >
>>> >> > > >
>>> >> > >
>>> >> >
>>> >>
>>> --------------------------------------------------------------------------------
>>> >> > > > stringField:   BINARY UNCOMPRESSED DO:4 FPO:38918
>>> >> > SZ:8181452/8181452/1.00
>>> >> > > > VC:5000001 ENC:BIT_PACKED,PLAIN_DICTIONARY ST:[min: 0, max: 999,
>>> >> > > num_nulls:
>>> >> > > > 0]
>>> >> > >
>>> >> > >
>>> >> > > But in the second case (50k unique values), parquet-tools shows
>>> that
>>> >> no
>>> >> > > dictionary gets created, and the file size is *much* bigger:
>>> >> > >
>>> >> > > % parquet-tools meta testdata-case2.parquet
>>> >> > > > file schema:  testdata.TestRecord
>>> >> > > >
>>> >> > > >
>>> >> > >
>>> >> >
>>> >>
>>> --------------------------------------------------------------------------------
>>> >> > > > stringField:  REQUIRED BINARY L:STRING R:0 D:0
>>> >> > > > row group 1:  RC:5000001 TS:18262874 OFFSET:4
>>> >> > > >
>>> >> > > >
>>> >> > >
>>> >> >
>>> >>
>>> --------------------------------------------------------------------------------
>>> >> > > > stringField:  BINARY UNCOMPRESSED DO:0 FPO:4
>>> >> SZ:43896278/43896278/1.00
>>> >> > > > VC:5000001 ENC:PLAIN,BIT_PACKED ST:[min: 0, max: 9999,
>>> num_nulls: 0]
>>> >> > >
>>> >> > >
>>> >> > > (I created a gist of my test reproduction here
>>> >> > > <
>>> >>
>>> https://gist.github.com/clairemcginty/c3c0be85f51bc23db45a75e8d8a18806
>>> >> > >.)
>>> >> > >
>>> >> > > Based on this, I'm guessing there's some tip-over point after
>>> which
>>> >> > Parquet
>>> >> > > will give up on writing a dictionary for a given column? After
>>> reading
>>> >> > > the Configuration
>>> >> > > docs
>>> >> > > <
>>> >> >
>>> >>
>>> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md
>>> >> > > >,
>>> >> > > I tried increasing the dictionary page size configuration 5x,
>>> with the
>>> >> > same
>>> >> > > result (no dictionary created).
>>> >> > >
>>> >> > > So to summarize, my questions are:
>>> >> > >
>>> >> > > - What's the heuristic for Parquet dictionary writing to succeed
>>> for a
>>> >> > > given column?
>>> >> > > - Is that heuristic configurable at all?
>>> >> > > - For high-cardinality datasets, has the idea of a frequency-based
>>> >> > > dictionary encoding been explored? Say, if the data follows a
>>> certain
>>> >> > > statistical distribution, we can create a dictionary of the most
>>> >> frequent
>>> >> > > values only?
>>> >> > >
>>> >> > > Thanks for your time!
>>> >> > > - Claire
>>> >> > >
>>> >> >
>>> >>
>>> >
>>>
>>

Reply via email to