Oh, interesting! I'm setting it via the ParquetWriter#withDictionaryPageSize method, and I do see the overall file size increasing when I bump the value. I'll look into it a bit more -- it would be helpful for some cases where the # unique values in a column is just over the size limit.
- Claire On Fri, Sep 15, 2023 at 9:54 AM Micah Kornfield <emkornfi...@gmail.com> wrote: > I'll note there is also a check for encoding effectiveness [1] that could > come into play but I'd guess that isn't the case here. > > [1] > > https://github.com/apache/parquet-mr/blob/9b5a962df3007009a227ef421600197531f970a5/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L124 > > On Fri, Sep 15, 2023 at 9:51 AM Micah Kornfield <emkornfi...@gmail.com> > wrote: > > > I'm glad I was looking at the right setting for dictionary size. I just > >> tried it out with 10x, 50x, and even total file size, though, and still > am > >> not seeing a dictionary get created. Is it possible it's bounded by file > >> page size or some other layout option that I need to bump as well? > > > > > > Sorry I'm less familiar with parquet-mr, hopefully someone else to chime > > in. If I had to guess, maybe somehow the config value isn't making it to > > the writer (but there could also be something else at play). > > > > On Fri, Sep 15, 2023 at 9:33 AM Claire McGinty < > claire.d.mcgi...@gmail.com> > > wrote: > > > >> Thanks so much, Micah! > >> > >> I think you are using the right setting, but maybe it is possible the > >> > strings are still exceeding the threshold (perhaps increasing it by > 50x > >> or > >> > more to verify) > >> > >> > >> I'm glad I was looking at the right setting for dictionary size. I just > >> tried it out with 10x, 50x, and even total file size, though, and still > am > >> not seeing a dictionary get created. Is it possible it's bounded by file > >> page size or some other layout option that I need to bump as well? > >> > >> I haven't seen my discussion during my time in the community but maybe > it > >> > was discussed in the past. I think the main challenge here is that > >> pages > >> > are either dictionary encoded or not. I'd guess to make this > practical > >> > there would need to be a new hybrid page type, which I think it might > >> be an > >> > interesting idea but quite a bit of work. Additionally, one would > >> likely > >> > need heuristics for when to potentially use the new mode versus a > >> complete > >> > fallback. > >> > > >> > >> Got it, thanks for the explanation! It does seem like a huge amount of > >> work > >> > >> > >> Best, > >> Claire > >> > >> > >> > >> On Thu, Sep 14, 2023 at 5:16 PM Micah Kornfield <emkornfi...@gmail.com> > >> wrote: > >> > >> > > > >> > > - What's the heuristic for Parquet dictionary writing to succeed > for a > >> > > given column? > >> > > >> > > >> > > >> > > >> > https://github.com/apache/parquet-mr/blob/9b5a962df3007009a227ef421600197531f970a5/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L117 > >> > > >> > > >> > > - Is that heuristic configurable at all? > >> > > >> > > >> > I think you are using the right setting, but maybe it is possible the > >> > strings are still exceeding the threshold (perhaps increasing it by > 50x > >> or > >> > more to verify) > >> > > >> > > >> > > - For high-cardinality datasets, has the idea of a frequency-based > >> > > dictionary encoding been explored? Say, if the data follows a > certain > >> > > statistical distribution, we can create a dictionary of the most > >> frequent > >> > > values only? > >> > > >> > I haven't seen my discussion during my time in the community but maybe > >> it > >> > was discussed in the past. I think the main challenge here is that > >> pages > >> > are either dictionary encoded or not. I'd guess to make this > practical > >> > there would need to be a new hybrid page type, which I think it might > >> be an > >> > interesting idea but quite a bit of work. Additionally, one would > >> likely > >> > need heuristics for when to potentially use the new mode versus a > >> complete > >> > fallback. > >> > > >> > Cheers, > >> > Micah > >> > > >> > On Thu, Sep 14, 2023 at 12:07 PM Claire McGinty < > >> > claire.d.mcgi...@gmail.com> > >> > wrote: > >> > > >> > > Hi dev@, > >> > > > >> > > I'm running some benchmarking on Parquet read/write performance and > >> have > >> > a > >> > > few questions about how dictionary encoding works under the hood. > Let > >> me > >> > > know if there's a better channel for this :) > >> > > > >> > > My test case uses parquet-avro, where I'm writing a single file > >> > containing > >> > > 5 million records. Each record has a single column, an Avro String > >> field > >> > > (Parquet binary field). I ran two configurations of base setup: in > the > >> > > first case, the string field has 5,000 possible unique values. In > the > >> > > second case, it has 50,000 unique values. > >> > > > >> > > In the first case (5k unique values), I used parquet-tools to > inspect > >> the > >> > > file metadata and found that a dictionary had been written: > >> > > > >> > > % parquet-tools meta testdata-case1.parquet > >> > > > file schema: testdata.TestRecord > >> > > > > >> > > > > >> > > > >> > > >> > -------------------------------------------------------------------------------- > >> > > > stringField: REQUIRED BINARY L:STRING R:0 D:0 > >> > > > row group 1: RC:5000001 TS:18262874 OFFSET:4 > >> > > > > >> > > > > >> > > > >> > > >> > -------------------------------------------------------------------------------- > >> > > > stringField: BINARY UNCOMPRESSED DO:4 FPO:38918 > >> > SZ:8181452/8181452/1.00 > >> > > > VC:5000001 ENC:BIT_PACKED,PLAIN_DICTIONARY ST:[min: 0, max: 999, > >> > > num_nulls: > >> > > > 0] > >> > > > >> > > > >> > > But in the second case (50k unique values), parquet-tools shows that > >> no > >> > > dictionary gets created, and the file size is *much* bigger: > >> > > > >> > > % parquet-tools meta testdata-case2.parquet > >> > > > file schema: testdata.TestRecord > >> > > > > >> > > > > >> > > > >> > > >> > -------------------------------------------------------------------------------- > >> > > > stringField: REQUIRED BINARY L:STRING R:0 D:0 > >> > > > row group 1: RC:5000001 TS:18262874 OFFSET:4 > >> > > > > >> > > > > >> > > > >> > > >> > -------------------------------------------------------------------------------- > >> > > > stringField: BINARY UNCOMPRESSED DO:0 FPO:4 > >> SZ:43896278/43896278/1.00 > >> > > > VC:5000001 ENC:PLAIN,BIT_PACKED ST:[min: 0, max: 9999, num_nulls: > 0] > >> > > > >> > > > >> > > (I created a gist of my test reproduction here > >> > > < > >> https://gist.github.com/clairemcginty/c3c0be85f51bc23db45a75e8d8a18806 > >> > >.) > >> > > > >> > > Based on this, I'm guessing there's some tip-over point after which > >> > Parquet > >> > > will give up on writing a dictionary for a given column? After > reading > >> > > the Configuration > >> > > docs > >> > > < > >> > > >> > https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md > >> > > >, > >> > > I tried increasing the dictionary page size configuration 5x, with > the > >> > same > >> > > result (no dictionary created). > >> > > > >> > > So to summarize, my questions are: > >> > > > >> > > - What's the heuristic for Parquet dictionary writing to succeed > for a > >> > > given column? > >> > > - Is that heuristic configurable at all? > >> > > - For high-cardinality datasets, has the idea of a frequency-based > >> > > dictionary encoding been explored? Say, if the data follows a > certain > >> > > statistical distribution, we can create a dictionary of the most > >> frequent > >> > > values only? > >> > > > >> > > Thanks for your time! > >> > > - Claire > >> > > > >> > > >> > > >