> > - What's the heuristic for Parquet dictionary writing to succeed for a > given column?
https://github.com/apache/parquet-mr/blob/9b5a962df3007009a227ef421600197531f970a5/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L117 > - Is that heuristic configurable at all? I think you are using the right setting, but maybe it is possible the strings are still exceeding the threshold (perhaps increasing it by 50x or more to verify) > - For high-cardinality datasets, has the idea of a frequency-based > dictionary encoding been explored? Say, if the data follows a certain > statistical distribution, we can create a dictionary of the most frequent > values only? I haven't seen my discussion during my time in the community but maybe it was discussed in the past. I think the main challenge here is that pages are either dictionary encoded or not. I'd guess to make this practical there would need to be a new hybrid page type, which I think it might be an interesting idea but quite a bit of work. Additionally, one would likely need heuristics for when to potentially use the new mode versus a complete fallback. Cheers, Micah On Thu, Sep 14, 2023 at 12:07 PM Claire McGinty <claire.d.mcgi...@gmail.com> wrote: > Hi dev@, > > I'm running some benchmarking on Parquet read/write performance and have a > few questions about how dictionary encoding works under the hood. Let me > know if there's a better channel for this :) > > My test case uses parquet-avro, where I'm writing a single file containing > 5 million records. Each record has a single column, an Avro String field > (Parquet binary field). I ran two configurations of base setup: in the > first case, the string field has 5,000 possible unique values. In the > second case, it has 50,000 unique values. > > In the first case (5k unique values), I used parquet-tools to inspect the > file metadata and found that a dictionary had been written: > > % parquet-tools meta testdata-case1.parquet > > file schema: testdata.TestRecord > > > > > -------------------------------------------------------------------------------- > > stringField: REQUIRED BINARY L:STRING R:0 D:0 > > row group 1: RC:5000001 TS:18262874 OFFSET:4 > > > > > -------------------------------------------------------------------------------- > > stringField: BINARY UNCOMPRESSED DO:4 FPO:38918 SZ:8181452/8181452/1.00 > > VC:5000001 ENC:BIT_PACKED,PLAIN_DICTIONARY ST:[min: 0, max: 999, > num_nulls: > > 0] > > > But in the second case (50k unique values), parquet-tools shows that no > dictionary gets created, and the file size is *much* bigger: > > % parquet-tools meta testdata-case2.parquet > > file schema: testdata.TestRecord > > > > > -------------------------------------------------------------------------------- > > stringField: REQUIRED BINARY L:STRING R:0 D:0 > > row group 1: RC:5000001 TS:18262874 OFFSET:4 > > > > > -------------------------------------------------------------------------------- > > stringField: BINARY UNCOMPRESSED DO:0 FPO:4 SZ:43896278/43896278/1.00 > > VC:5000001 ENC:PLAIN,BIT_PACKED ST:[min: 0, max: 9999, num_nulls: 0] > > > (I created a gist of my test reproduction here > <https://gist.github.com/clairemcginty/c3c0be85f51bc23db45a75e8d8a18806>.) > > Based on this, I'm guessing there's some tip-over point after which Parquet > will give up on writing a dictionary for a given column? After reading > the Configuration > docs > <https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md > >, > I tried increasing the dictionary page size configuration 5x, with the same > result (no dictionary created). > > So to summarize, my questions are: > > - What's the heuristic for Parquet dictionary writing to succeed for a > given column? > - Is that heuristic configurable at all? > - For high-cardinality datasets, has the idea of a frequency-based > dictionary encoding been explored? Say, if the data follows a certain > statistical distribution, we can create a dictionary of the most frequent > values only? > > Thanks for your time! > - Claire >