[I] Per-column dictionary encoding control in Parquet.WriteBuilder [iceberg]

via GitHub Thu, 28 May 2026 14:12:16 -0700


Gerrrr opened a new issue, #16599:
URL: https://github.com/apache/iceberg/issues/16599


   ### Feature Request / Improvement
   
   I'd like to be able to disable dictionary encoding for one specific column 
while keeping it on for the rest of the table. As far as I can tell there's no 
way to do this through Iceberg's writer today.
   
   Motivating case: a table with a binary column holding serialized message 
payload. Each payload is unique, so cardinality on that column is essentially 
one-per-row. The rest of the table is normal analytical data where dictionary 
encoding does what it's supposed to do.
   
   On write, parquet-java starts with dictionary encoding on the blob column. 
Every value is unique so the dict page hits its 2 MiB cap 
(`write.parquet.dict-size-bytes`) within a few thousand rows. parquet-java then 
re-encodes all the buffered rows as PLAIN and uses PLAIN for the rest of the 
row group. So the on-disk output looks like what PLAIN + ZSTD would have given 
us from row 1. The cost of getting there is building the dictionary in the 
first place, holding it in memory, and re-encoding everything already buffered 
when the cap hits. With several blob columns in the same table, or with 
multiple writers running in parallel, the CPU and heap overhead adds up.
   
     parquet-java already supports per-column dictionary control via 
`ParquetProperties.Builder.withDictionaryEncoding(String columnPath, boolean)`. 
Iceberg's `Parquet.WriteBuilder.build()` only calls the global boolean form, 
and `Context.dataContext()` reads a single `write.parquet.dict-enabled` value. 
There's no way to thread the per-column setting through, including via 
`setAll(...)`.
   
     Proposal: add `write.parquet.dict-enabled.column.<columnPath>` as a 
property prefix parsed in `Context.dataContext`, matching the existing 
per-column convention used by `write.parquet.bloom-filter-enabled.column.*` and 
`write.parquet.stats-enabled.column.*` (and what #16090 proposes for 
compression codec). Per-column setting overrides the global one when present. 
Also add a matching `withDictionaryEncoding(String columnPath, boolean)` on 
`Parquet.WriteBuilder`
   
   ### Query engine
   
   None
   
   ### Willingness to contribute
   
   - [x] I can contribute this improvement/feature independently
   - [ ] I would be willing to contribute this improvement/feature with 
guidance from the Iceberg community
   - [ ] I cannot contribute this improvement/feature at this time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Per-column dictionary encoding control in Parquet.WriteBuilder [iceberg]

Reply via email to