Hi dev@,

I'm running some benchmarking on Parquet read/write performance and have a
few questions about how dictionary encoding works under the hood. Let me
know if there's a better channel for this :)

My test case uses parquet-avro, where I'm writing a single file containing
5 million records. Each record has a single column, an Avro String field
(Parquet binary field). I ran two configurations of base setup: in the
first case, the string field has 5,000 possible unique values. In the
second case, it has 50,000 unique values.

In the first case (5k unique values), I used parquet-tools to inspect the
file metadata and found that a dictionary had been written:

% parquet-tools meta testdata-case1.parquet
> file schema:  testdata.TestRecord
>
> --------------------------------------------------------------------------------
> stringField:  REQUIRED BINARY L:STRING R:0 D:0
> row group 1:  RC:5000001 TS:18262874 OFFSET:4
>
> --------------------------------------------------------------------------------
> stringField:   BINARY UNCOMPRESSED DO:4 FPO:38918 SZ:8181452/8181452/1.00
> VC:5000001 ENC:BIT_PACKED,PLAIN_DICTIONARY ST:[min: 0, max: 999, num_nulls:
> 0]


But in the second case (50k unique values), parquet-tools shows that no
dictionary gets created, and the file size is *much* bigger:

% parquet-tools meta testdata-case2.parquet
> file schema:  testdata.TestRecord
>
> --------------------------------------------------------------------------------
> stringField:  REQUIRED BINARY L:STRING R:0 D:0
> row group 1:  RC:5000001 TS:18262874 OFFSET:4
>
> --------------------------------------------------------------------------------
> stringField:  BINARY UNCOMPRESSED DO:0 FPO:4 SZ:43896278/43896278/1.00
> VC:5000001 ENC:PLAIN,BIT_PACKED ST:[min: 0, max: 9999, num_nulls: 0]


(I created a gist of my test reproduction here
<https://gist.github.com/clairemcginty/c3c0be85f51bc23db45a75e8d8a18806>.)

Based on this, I'm guessing there's some tip-over point after which Parquet
will give up on writing a dictionary for a given column? After reading
the Configuration
docs
<https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md>,
I tried increasing the dictionary page size configuration 5x, with the same
result (no dictionary created).

So to summarize, my questions are:

- What's the heuristic for Parquet dictionary writing to succeed for a
given column?
- Is that heuristic configurable at all?
- For high-cardinality datasets, has the idea of a frequency-based
dictionary encoding been explored? Say, if the data follows a certain
statistical distribution, we can create a dictionary of the most frequent
values only?

Thanks for your time!
- Claire

Reply via email to