rdblue commented on pull request #1947: URL: https://github.com/apache/iceberg/pull/1947#issuecomment-748246932
@aokolnychyi, I agree with the idea to have a flag to disable global sort. Probably best to do this specific to copy-on-write because delta writes will need to be sorted by `_file` and `_pos` for deletes and we expect the inserts to be much, much smaller than the copy-on-write data. If we aren't rewriting retained rows, I think the global sort (with a repartition as you said) would be much cheaper. For sorting by `_file` and `_pos`, what if we only did that for existing rows? We can discard the columns for updated rows. That way we rewrite the data files as though the rows were deleted and append the inserts and updates together. We may even want to do this in all cases: always prepend `_file` and `_pos` to whatever sort order we inject. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
