Claire, thank you for your research and examples on this topic, I've learned a lot. My hunch is that your change would be a good one, but I'm not an expert (and more to the point, not a committer). I'm looking forward to learning more as this discussion continues.
Thank you again, Aaron On Tue, Sep 19, 2023 at 2:48 PM Claire McGinty <claire.d.mcgi...@gmail.com> wrote: > I created a quick branch > < > https://github.com/apache/parquet-mr/compare/master...clairemcginty:parquet-mr:dict-size-repro?expand=1 > > > to reproduce what I'm seeing -- the test shows that an Int column with > cardinality 100 successfully results in a dict encoding, but an int column > with cardinality 10,000 falls back and doesn't create a dict encoding. This > seems like a low threshold given the 1MB dictionary page size, so I just > wanted to check whether this is expected or not :) > > Best, > Claire > > On Tue, Sep 19, 2023 at 9:35 AM Claire McGinty <claire.d.mcgi...@gmail.com > > > wrote: > > > Hi, just wanted to follow up on this! > > > > I ran a debugger to find out why my column wasn't ending up with a > > dictionary encoding and it turns out that even though > > DictionaryValuesWriter#shouldFallback() > > < > https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L117 > > > > always returned false (dictionaryByteSize was always less than my > > configured page size), DictionaryValuesWriter#isCompressionSatisfying > > < > https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L125> > was > > what was causing Parquet to switch > > < > https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/fallback/FallbackValuesWriter.java#L75 > > > > back to the fallback, non-dict writer. > > > > From what I can tell, this check compares the total byte size of > > *every* element with the byte size of each *distinct* element as a kind > of > > proxy for encoding efficiency.... however, it seems strange that this > check > > can cause the writer to fall back even if the total encoded dict size is > > far below the configured dictionary page size. Out of curiosity, I > modified > > DictionaryValuesWriter#isCompressionSatisfying > > < > https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L125> > to > > also check whether total byte size was less than dictionary max size and > > re-ran my Parquet write with a local snapshot, and my file size dropped > 50%. > > > > Best, > > Claire > > > > On Mon, Sep 18, 2023 at 9:16 AM Claire McGinty < > claire.d.mcgi...@gmail.com> > > wrote: > > > >> Oh, interesting! I'm setting it via the > >> ParquetWriter#withDictionaryPageSize method, and I do see the overall > file > >> size increasing when I bump the value. I'll look into it a bit more -- > it > >> would be helpful for some cases where the # unique values in a column is > >> just over the size limit. > >> > >> - Claire > >> > >> On Fri, Sep 15, 2023 at 9:54 AM Micah Kornfield <emkornfi...@gmail.com> > >> wrote: > >> > >>> I'll note there is also a check for encoding effectiveness [1] that > could > >>> come into play but I'd guess that isn't the case here. > >>> > >>> [1] > >>> > >>> > https://github.com/apache/parquet-mr/blob/9b5a962df3007009a227ef421600197531f970a5/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L124 > >>> > >>> On Fri, Sep 15, 2023 at 9:51 AM Micah Kornfield <emkornfi...@gmail.com > > > >>> wrote: > >>> > >>> > I'm glad I was looking at the right setting for dictionary size. I > just > >>> >> tried it out with 10x, 50x, and even total file size, though, and > >>> still am > >>> >> not seeing a dictionary get created. Is it possible it's bounded by > >>> file > >>> >> page size or some other layout option that I need to bump as well? > >>> > > >>> > > >>> > Sorry I'm less familiar with parquet-mr, hopefully someone else to > >>> chime > >>> > in. If I had to guess, maybe somehow the config value isn't making > it > >>> to > >>> > the writer (but there could also be something else at play). > >>> > > >>> > On Fri, Sep 15, 2023 at 9:33 AM Claire McGinty < > >>> claire.d.mcgi...@gmail.com> > >>> > wrote: > >>> > > >>> >> Thanks so much, Micah! > >>> >> > >>> >> I think you are using the right setting, but maybe it is possible > the > >>> >> > strings are still exceeding the threshold (perhaps increasing it > by > >>> 50x > >>> >> or > >>> >> > more to verify) > >>> >> > >>> >> > >>> >> I'm glad I was looking at the right setting for dictionary size. I > >>> just > >>> >> tried it out with 10x, 50x, and even total file size, though, and > >>> still am > >>> >> not seeing a dictionary get created. Is it possible it's bounded by > >>> file > >>> >> page size or some other layout option that I need to bump as well? > >>> >> > >>> >> I haven't seen my discussion during my time in the community but > >>> maybe it > >>> >> > was discussed in the past. I think the main challenge here is > that > >>> >> pages > >>> >> > are either dictionary encoded or not. I'd guess to make this > >>> practical > >>> >> > there would need to be a new hybrid page type, which I think it > >>> might > >>> >> be an > >>> >> > interesting idea but quite a bit of work. Additionally, one would > >>> >> likely > >>> >> > need heuristics for when to potentially use the new mode versus a > >>> >> complete > >>> >> > fallback. > >>> >> > > >>> >> > >>> >> Got it, thanks for the explanation! It does seem like a huge amount > of > >>> >> work > >>> >> > >>> >> > >>> >> Best, > >>> >> Claire > >>> >> > >>> >> > >>> >> > >>> >> On Thu, Sep 14, 2023 at 5:16 PM Micah Kornfield < > >>> emkornfi...@gmail.com> > >>> >> wrote: > >>> >> > >>> >> > > > >>> >> > > - What's the heuristic for Parquet dictionary writing to succeed > >>> for a > >>> >> > > given column? > >>> >> > > >>> >> > > >>> >> > > >>> >> > > >>> >> > >>> > https://github.com/apache/parquet-mr/blob/9b5a962df3007009a227ef421600197531f970a5/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L117 > >>> >> > > >>> >> > > >>> >> > > - Is that heuristic configurable at all? > >>> >> > > >>> >> > > >>> >> > I think you are using the right setting, but maybe it is possible > >>> the > >>> >> > strings are still exceeding the threshold (perhaps increasing it > by > >>> 50x > >>> >> or > >>> >> > more to verify) > >>> >> > > >>> >> > > >>> >> > > - For high-cardinality datasets, has the idea of a > frequency-based > >>> >> > > dictionary encoding been explored? Say, if the data follows a > >>> certain > >>> >> > > statistical distribution, we can create a dictionary of the most > >>> >> frequent > >>> >> > > values only? > >>> >> > > >>> >> > I haven't seen my discussion during my time in the community but > >>> maybe > >>> >> it > >>> >> > was discussed in the past. I think the main challenge here is > that > >>> >> pages > >>> >> > are either dictionary encoded or not. I'd guess to make this > >>> practical > >>> >> > there would need to be a new hybrid page type, which I think it > >>> might > >>> >> be an > >>> >> > interesting idea but quite a bit of work. Additionally, one would > >>> >> likely > >>> >> > need heuristics for when to potentially use the new mode versus a > >>> >> complete > >>> >> > fallback. > >>> >> > > >>> >> > Cheers, > >>> >> > Micah > >>> >> > > >>> >> > On Thu, Sep 14, 2023 at 12:07 PM Claire McGinty < > >>> >> > claire.d.mcgi...@gmail.com> > >>> >> > wrote: > >>> >> > > >>> >> > > Hi dev@, > >>> >> > > > >>> >> > > I'm running some benchmarking on Parquet read/write performance > >>> and > >>> >> have > >>> >> > a > >>> >> > > few questions about how dictionary encoding works under the > hood. > >>> Let > >>> >> me > >>> >> > > know if there's a better channel for this :) > >>> >> > > > >>> >> > > My test case uses parquet-avro, where I'm writing a single file > >>> >> > containing > >>> >> > > 5 million records. Each record has a single column, an Avro > String > >>> >> field > >>> >> > > (Parquet binary field). I ran two configurations of base setup: > >>> in the > >>> >> > > first case, the string field has 5,000 possible unique values. > In > >>> the > >>> >> > > second case, it has 50,000 unique values. > >>> >> > > > >>> >> > > In the first case (5k unique values), I used parquet-tools to > >>> inspect > >>> >> the > >>> >> > > file metadata and found that a dictionary had been written: > >>> >> > > > >>> >> > > % parquet-tools meta testdata-case1.parquet > >>> >> > > > file schema: testdata.TestRecord > >>> >> > > > > >>> >> > > > > >>> >> > > > >>> >> > > >>> >> > >>> > -------------------------------------------------------------------------------- > >>> >> > > > stringField: REQUIRED BINARY L:STRING R:0 D:0 > >>> >> > > > row group 1: RC:5000001 TS:18262874 OFFSET:4 > >>> >> > > > > >>> >> > > > > >>> >> > > > >>> >> > > >>> >> > >>> > -------------------------------------------------------------------------------- > >>> >> > > > stringField: BINARY UNCOMPRESSED DO:4 FPO:38918 > >>> >> > SZ:8181452/8181452/1.00 > >>> >> > > > VC:5000001 ENC:BIT_PACKED,PLAIN_DICTIONARY ST:[min: 0, max: > 999, > >>> >> > > num_nulls: > >>> >> > > > 0] > >>> >> > > > >>> >> > > > >>> >> > > But in the second case (50k unique values), parquet-tools shows > >>> that > >>> >> no > >>> >> > > dictionary gets created, and the file size is *much* bigger: > >>> >> > > > >>> >> > > % parquet-tools meta testdata-case2.parquet > >>> >> > > > file schema: testdata.TestRecord > >>> >> > > > > >>> >> > > > > >>> >> > > > >>> >> > > >>> >> > >>> > -------------------------------------------------------------------------------- > >>> >> > > > stringField: REQUIRED BINARY L:STRING R:0 D:0 > >>> >> > > > row group 1: RC:5000001 TS:18262874 OFFSET:4 > >>> >> > > > > >>> >> > > > > >>> >> > > > >>> >> > > >>> >> > >>> > -------------------------------------------------------------------------------- > >>> >> > > > stringField: BINARY UNCOMPRESSED DO:0 FPO:4 > >>> >> SZ:43896278/43896278/1.00 > >>> >> > > > VC:5000001 ENC:PLAIN,BIT_PACKED ST:[min: 0, max: 9999, > >>> num_nulls: 0] > >>> >> > > > >>> >> > > > >>> >> > > (I created a gist of my test reproduction here > >>> >> > > < > >>> >> > >>> https://gist.github.com/clairemcginty/c3c0be85f51bc23db45a75e8d8a18806 > >>> >> > >.) > >>> >> > > > >>> >> > > Based on this, I'm guessing there's some tip-over point after > >>> which > >>> >> > Parquet > >>> >> > > will give up on writing a dictionary for a given column? After > >>> reading > >>> >> > > the Configuration > >>> >> > > docs > >>> >> > > < > >>> >> > > >>> >> > >>> > https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md > >>> >> > > >, > >>> >> > > I tried increasing the dictionary page size configuration 5x, > >>> with the > >>> >> > same > >>> >> > > result (no dictionary created). > >>> >> > > > >>> >> > > So to summarize, my questions are: > >>> >> > > > >>> >> > > - What's the heuristic for Parquet dictionary writing to succeed > >>> for a > >>> >> > > given column? > >>> >> > > - Is that heuristic configurable at all? > >>> >> > > - For high-cardinality datasets, has the idea of a > frequency-based > >>> >> > > dictionary encoding been explored? Say, if the data follows a > >>> certain > >>> >> > > statistical distribution, we can create a dictionary of the most > >>> >> frequent > >>> >> > > values only? > >>> >> > > > >>> >> > > Thanks for your time! > >>> >> > > - Claire > >>> >> > > > >>> >> > > >>> >> > >>> > > >>> > >> > -- Aaron Niskode-Dossett, Data Engineering -- Etsy