Gerrrr opened a new issue, #16599:
URL: https://github.com/apache/iceberg/issues/16599
### Feature Request / Improvement
I'd like to be able to disable dictionary encoding for one specific column
while keeping it on for the rest of the table. As far as I can tell there's no
way to do this through Iceberg's writer today.
Motivating case: a table with a binary column holding serialized message
payload. Each payload is unique, so cardinality on that column is essentially
one-per-row. The rest of the table is normal analytical data where dictionary
encoding does what it's supposed to do.
On write, parquet-java starts with dictionary encoding on the blob column.
Every value is unique so the dict page hits its 2 MiB cap
(`write.parquet.dict-size-bytes`) within a few thousand rows. parquet-java then
re-encodes all the buffered rows as PLAIN and uses PLAIN for the rest of the
row group. So the on-disk output looks like what PLAIN + ZSTD would have given
us from row 1. The cost of getting there is building the dictionary in the
first place, holding it in memory, and re-encoding everything already buffered
when the cap hits. With several blob columns in the same table, or with
multiple writers running in parallel, the CPU and heap overhead adds up.
parquet-java already supports per-column dictionary control via
`ParquetProperties.Builder.withDictionaryEncoding(String columnPath, boolean)`.
Iceberg's `Parquet.WriteBuilder.build()` only calls the global boolean form,
and `Context.dataContext()` reads a single `write.parquet.dict-enabled` value.
There's no way to thread the per-column setting through, including via
`setAll(...)`.
Proposal: add `write.parquet.dict-enabled.column.<columnPath>` as a
property prefix parsed in `Context.dataContext`, matching the existing
per-column convention used by `write.parquet.bloom-filter-enabled.column.*` and
`write.parquet.stats-enabled.column.*` (and what #16090 proposes for
compression codec). Per-column setting overrides the global one when present.
Also add a matching `withDictionaryEncoding(String columnPath, boolean)` on
`Parquet.WriteBuilder`
### Query engine
None
### Willingness to contribute
- [x] I can contribute this improvement/feature independently
- [ ] I would be willing to contribute this improvement/feature with
guidance from the Iceberg community
- [ ] I cannot contribute this improvement/feature at this time
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]