aokolnychyi commented on pull request #1947: URL: https://github.com/apache/iceberg/pull/1947#issuecomment-748938791
Okay, seems like we agree on doing a global sort (after an extra round-robin partitioning to make sure we don't execute the merge join twice) and having a table property under `write.merge.copy-on-write` to enable/disable the global sort in MERGE. If the global sort is disabled, we still need to perform a local sort. We will need to think about a proper name for the property but we can do this later. Seeing what performance penalty a global sort can introduce inside MERGE statements, I'd recommend not doing this by default but I can be convinced otherwise. > For sorting by _file and _pos, what if we only did that for existing rows? We can discard the columns for updated rows. That way we rewrite the data files as though the rows were deleted and append the inserts and updates together. We may even want to do this in all cases: always prepend _file and _pos to whatever sort order we inject. I think this is promising if we can easily nullify `_file` and `_pos` for updated rows and if Spark range estimation will do what we hope. Can someone estimate the complexity of implementing this? I'd support this idea. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
