Hi dev@, I'm running some benchmarking on Parquet read/write performance and have a few questions about how dictionary encoding works under the hood. Let me know if there's a better channel for this :)
My test case uses parquet-avro, where I'm writing a single file containing 5 million records. Each record has a single column, an Avro String field (Parquet binary field). I ran two configurations of base setup: in the first case, the string field has 5,000 possible unique values. In the second case, it has 50,000 unique values. In the first case (5k unique values), I used parquet-tools to inspect the file metadata and found that a dictionary had been written: % parquet-tools meta testdata-case1.parquet > file schema: testdata.TestRecord > > -------------------------------------------------------------------------------- > stringField: REQUIRED BINARY L:STRING R:0 D:0 > row group 1: RC:5000001 TS:18262874 OFFSET:4 > > -------------------------------------------------------------------------------- > stringField: BINARY UNCOMPRESSED DO:4 FPO:38918 SZ:8181452/8181452/1.00 > VC:5000001 ENC:BIT_PACKED,PLAIN_DICTIONARY ST:[min: 0, max: 999, num_nulls: > 0] But in the second case (50k unique values), parquet-tools shows that no dictionary gets created, and the file size is *much* bigger: % parquet-tools meta testdata-case2.parquet > file schema: testdata.TestRecord > > -------------------------------------------------------------------------------- > stringField: REQUIRED BINARY L:STRING R:0 D:0 > row group 1: RC:5000001 TS:18262874 OFFSET:4 > > -------------------------------------------------------------------------------- > stringField: BINARY UNCOMPRESSED DO:0 FPO:4 SZ:43896278/43896278/1.00 > VC:5000001 ENC:PLAIN,BIT_PACKED ST:[min: 0, max: 9999, num_nulls: 0] (I created a gist of my test reproduction here <https://gist.github.com/clairemcginty/c3c0be85f51bc23db45a75e8d8a18806>.) Based on this, I'm guessing there's some tip-over point after which Parquet will give up on writing a dictionary for a given column? After reading the Configuration docs <https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md>, I tried increasing the dictionary page size configuration 5x, with the same result (no dictionary created). So to summarize, my questions are: - What's the heuristic for Parquet dictionary writing to succeed for a given column? - Is that heuristic configurable at all? - For high-cardinality datasets, has the idea of a frequency-based dictionary encoding been explored? Say, if the data follows a certain statistical distribution, we can create a dictionary of the most frequent values only? Thanks for your time! - Claire