RussellSpitzer commented on PR #8621:
URL: https://github.com/apache/iceberg/pull/8621#issuecomment-1732196831

   We talked about this before offline and I think it's probably the right 
thing to do.
   
   I only wonder about the combination between this and our inability to handle 
distribution and sort orders in older Spark versions. Currently we do fail out 
when folks attempt to write unsourced data during a CTAS statement on older 
spark releases. This would cause those writes to succeed but probably create a 
huge number of files. 
   
   For example before if I did a statement like
   
   Create table ... PARTITIONED BY bucket (x, 100) ... as SELECT ...
   
   Previously on Spark 3.4 this would fail because the data within the write 
tasks would not abide by the partitioning request. Users don't like this but it 
does prevent them from accidentally making 100 files per spark task because the 
distribution from the partitioning was ignored. With that write would succeed 
and we would end up writing those 100 files per spark task.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to