aokolnychyi commented on pull request #1947: URL: https://github.com/apache/iceberg/pull/1947#issuecomment-748123544
> Please take a look at #1955. That exposes _pos so that we can use it. That is a great PR, let's get it in today. > If the table has a sort order, add a global sort of all the rows produced by the merge node. We have this option internally and it works well in some cases. There are a few things we need to be careful about, though. First, Spark will do a skew estimation step and the actual shuffle using two separate jobs. We don't want to recompute the merge join twice. We have a repartition stage after the join if a global sort on write is requested. While it does help a bit, it is not ideal. We have seen cases where the sort on write is by far the most expensive step of MERGE. Second, even when we do a global sort, the layout within partitions won't be ideal. So people will most likely have to compact again making the global sort during MERGE redundant. That's why we have to be careful about a global sort by default. I think this ultimately depends on the use case. Shall we make this configurable in table properties? How many query engines will follow it? Should that config be `copy-on-write` specific? I don't have answers to all the questions but it sounds reasonable to explore. At the same time, if we don't do the global sort, we may end up having too many small files after the operation. We can consider doing a repartition by the partition columns and sorting by the sort key but that will suffer if we have a lot of data for a single partition. It would be great to know the number of files and the size of data we need to rewrite per partition to make a good decision here. > If the table does not have a sort order, then add a default sort by _file, _pos, partition columns, and the MERGE condition's references. Sorting updated records by `_file` and `_pos` may be a bit tricky. For example, I have a file with columns (p, c1, c2, c3) in partition 'A' that is sorted by c1 and c2. If I have a merge command that updates c2 column (part of my sort key), my new records will be probably out of order if I sort by `_file` and `_pos`. That said, this is a fallback scenario so it may be not that big a deal. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
