aokolnychyi commented on pull request #1947:
URL: https://github.com/apache/iceberg/pull/1947#issuecomment-748938791


   Okay, seems like we agree on doing a global sort (after an extra round-robin 
partitioning to make sure we don't execute the merge join twice) and having a 
table property under `write.merge.copy-on-write` to enable/disable the global 
sort in MERGE. If the global sort is disabled, we still need to perform a local 
sort. We will need to think about a proper name for the property but we can do 
this later. Seeing what performance penalty a global sort can introduce inside 
MERGE statements, I'd recommend not doing this by default but I can be 
convinced otherwise.
   
   > For sorting by _file and _pos, what if we only did that for existing rows? 
We can discard the columns for updated rows. That way we rewrite the data files 
as though the rows were deleted and append the inserts and updates together. We 
may even want to do this in all cases: always prepend _file and _pos to 
whatever sort order we inject.
   
   I think this is promising if we can easily nullify `_file` and `_pos` for 
updated rows and if Spark range estimation will do what we hope. Can someone 
estimate the complexity of implementing this? I'd support this idea.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to