jackMintao opened a new issue, #16767:
URL: https://github.com/apache/iceberg/issues/16767

   ### Feature Request / Improvement
   
   ### Problem Description
   
   When write.distribution-mode=hash is set on an unpartitioned Iceberg table, 
the system silently downgrades the distribution mode to NONE because there are 
no partition columns to hash-distribute by. This happens in 
SparkWriteConf.adjustWriteDistributionMode():
   } else if (mode == HASH && table.spec().isUnpartitioned()) {
       return NONE;
   }
   This means users cannot control the number of output files for unpartitioned 
tables via hash distribution. They are forced into NONE mode where each Spark 
task writes its own file, leading to either too many small files (fanout) or 
too few large files depending on upstream parallelism.
   
   ### Current Behavior
   
   1. Set write.distribution-mode=hash on an unpartitioned table → distribution 
mode is silently downgraded to NONE
   2. No mechanism exists to specify which columns to hash-distribute by for 
unpartitioned tables
   3. If a user manually configures distribution columns that don't exist in 
the table schema, the write fails at runtime with an opaque Spark error 
(NamedReference to non-existent column)
   
   ### Expected Behavior
   
   1. Users can specify custom columns for hash distribution on unpartitioned 
tables via a new distribution-columns configuration
   2. When configured, HASH mode is preserved (not downgraded) and data is 
shuffled by the specified columns
   3. A "*" wildcard expands to all schema columns dynamically
   4. An ignore-missing fallback option gracefully handles columns that don't 
exist in the schema by falling back to sort-order columns or all columns
   
   
   ### Motivation
   
   This feature was motivated by a production use-case where users writing to 
unpartitioned streaming tables needed to control output file counts to avoid 
overwhelming downstream readers. Without hash distribution on unpartitioned 
tables, file counts are dictated by upstream task parallelism and cannot be 
tuned independently.
   
   The ignore-missing fallback was added to handle cases where 
distribution-columns are configured at a catalog level but referenced columns 
may not exist in all target tables, preventing runtime failures while 
maintaining meaningful distribution where possible.
   
   
   ### Query engine
   
   Spark
   
   ### Willingness to contribute
   
   - [x] I can contribute this improvement/feature independently
   - [ ] I would be willing to contribute this improvement/feature with 
guidance from the Iceberg community
   - [ ] I cannot contribute this improvement/feature at this time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to