RussellSpitzer commented on PR #8621: URL: https://github.com/apache/iceberg/pull/8621#issuecomment-1732196831
We talked about this before offline and I think it's probably the right thing to do. I only wonder about the combination between this and our inability to handle distribution and sort orders in older Spark versions. Currently we do fail out when folks attempt to write unsourced data during a CTAS statement on older spark releases. This would cause those writes to succeed but probably create a huge number of files. For example before if I did a statement like Create table ... PARTITIONED BY bucket (x, 100) ... as SELECT ... Previously on Spark 3.4 this would fail because the data within the write tasks would not abide by the partitioning request. Users don't like this but it does prevent them from accidentally making 100 files per spark task because the distribution from the partitioning was ignored. With that write would succeed and we would end up writing those 100 files per spark task. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
