[GitHub] [iceberg] SinghAsDev commented on pull request #4692: WIP: Remove redundant sorts from copy on write deletes

GitBox Wed, 18 May 2022 14:26:05 -0700


SinghAsDev commented on PR #4692:
URL: https://github.com/apache/iceberg/pull/4692#issuecomment-1130567256


   > The use case we are talking about is copy-on-write DELETE **_executed 
using a broadcast join_** where we read files from the current spec, only one 
file per split, files are already reasonably compacted and sorted as needed. 
Right now, we can avoid the shuffle by setting the distribution mode to `none` 
but we can't disable a potentially redundant local sort. Is my understanding 
correct?
   
   That is correct. However, we want to remove sorts automatically without 
users having to figure out if all files are already sorted or not. So, imaging 
a dataset that goes through deletion daily. The first deletion should do a 
global sort while re-writing, however future deletes should not have to perform 
any sort (not even local sort).
   
   > If we had a way to pass options to DELETE commands, we could simply 
support the same using these steps:
   > 
   > * Set `read.split.open-file-cost` to `Long.MaxValue` to force one file per 
split
   > * Set `write.delete.distribution-mode` to `none`
   > * Set write option `use-table-distribution-and-ordering` to `false`
   > 
   > I will be adding OPTIONS to row-level commands in Spark too but it won't 
be there until Spark 3.4.
   
   For this to work users would have to first figure out if all files are 
sorted or not and then do things differently accordingly. If that is something 
user want to do they can do that by explicitly adding `order by` only in first 
delete. However, there will still be a redundant local sort in subsequent 
deletes.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] SinghAsDev commented on pull request #4692: WIP: Remove redundant sorts from copy on write deletes

Reply via email to