SinghAsDev commented on PR #4692: URL: https://github.com/apache/iceberg/pull/4692#issuecomment-1130567256
> The use case we are talking about is copy-on-write DELETE **_executed using a broadcast join_** where we read files from the current spec, only one file per split, files are already reasonably compacted and sorted as needed. Right now, we can avoid the shuffle by setting the distribution mode to `none` but we can't disable a potentially redundant local sort. Is my understanding correct? That is correct. However, we want to remove sorts automatically without users having to figure out if all files are already sorted or not. So, imaging a dataset that goes through deletion daily. The first deletion should do a global sort while re-writing, however future deletes should not have to perform any sort (not even local sort). > If we had a way to pass options to DELETE commands, we could simply support the same using these steps: > > * Set `read.split.open-file-cost` to `Long.MaxValue` to force one file per split > * Set `write.delete.distribution-mode` to `none` > * Set write option `use-table-distribution-and-ordering` to `false` > > I will be adding OPTIONS to row-level commands in Spark too but it won't be there until Spark 3.4. For this to work users would have to first figure out if all files are sorted or not and then do things differently accordingly. If that is something user want to do they can do that by explicitly adding `order by` only in first delete. However, there will still be a redundant local sort in subsequent deletes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
