jackMintao opened a new issue, #16767:
URL: https://github.com/apache/iceberg/issues/16767
### Feature Request / Improvement
### Problem Description
When write.distribution-mode=hash is set on an unpartitioned Iceberg table,
the system silently downgrades the distribution mode to NONE because there are
no partition columns to hash-distribute by. This happens in
SparkWriteConf.adjustWriteDistributionMode():
} else if (mode == HASH && table.spec().isUnpartitioned()) {
return NONE;
}
This means users cannot control the number of output files for unpartitioned
tables via hash distribution. They are forced into NONE mode where each Spark
task writes its own file, leading to either too many small files (fanout) or
too few large files depending on upstream parallelism.
### Current Behavior
1. Set write.distribution-mode=hash on an unpartitioned table → distribution
mode is silently downgraded to NONE
2. No mechanism exists to specify which columns to hash-distribute by for
unpartitioned tables
3. If a user manually configures distribution columns that don't exist in
the table schema, the write fails at runtime with an opaque Spark error
(NamedReference to non-existent column)
### Expected Behavior
1. Users can specify custom columns for hash distribution on unpartitioned
tables via a new distribution-columns configuration
2. When configured, HASH mode is preserved (not downgraded) and data is
shuffled by the specified columns
3. A "*" wildcard expands to all schema columns dynamically
4. An ignore-missing fallback option gracefully handles columns that don't
exist in the schema by falling back to sort-order columns or all columns
### Motivation
This feature was motivated by a production use-case where users writing to
unpartitioned streaming tables needed to control output file counts to avoid
overwhelming downstream readers. Without hash distribution on unpartitioned
tables, file counts are dictated by upstream task parallelism and cannot be
tuned independently.
The ignore-missing fallback was added to handle cases where
distribution-columns are configured at a catalog level but referenced columns
may not exist in all target tables, preventing runtime failures while
maintaining meaningful distribution where possible.
### Query engine
Spark
### Willingness to contribute
- [x] I can contribute this improvement/feature independently
- [ ] I would be willing to contribute this improvement/feature with
guidance from the Iceberg community
- [ ] I cannot contribute this improvement/feature at this time
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]