[GitHub] [iceberg] aokolnychyi commented on pull request #1947: [WIP] Spark MERGE INTO Support (copy-on-write implementation)

GitBox Fri, 18 Dec 2020 06:48:17 -0800


aokolnychyi commented on pull request #1947:
URL: https://github.com/apache/iceberg/pull/1947#issuecomment-748123544



   > Please take a look at #1955. That exposes _pos so that we can use it.
   
   That is a great PR, let's get it in today.
   
   > If the table has a sort order, add a global sort of all the rows produced 
by the merge node.
   
   We have this option internally and it works well in some cases. There are a 
few things we need to be careful about, though.
   
   First, Spark will do a skew estimation step and the actual shuffle using two 
separate jobs. We don't want to recompute the merge join twice. We have a 
repartition stage after the join if a global sort on write is requested. While 
it does help a bit, it is not ideal. We have seen cases where the sort on write 
is by far the most expensive step of MERGE.
   
   Second, even when we do a global sort, the layout within partitions won't be 
ideal. So people will most likely have to compact again making the global sort 
during MERGE redundant.
   
   That's why we have to be careful about a global sort by default. I think 
this ultimately depends on the use case. Shall we make this configurable in 
table properties? How many query engines will follow it? Should that config be 
`copy-on-write` specific? I don't have answers to all the questions but it 
sounds reasonable to explore.
   
   At the same time, if we don't do the global sort, we may end up having too 
many small files after the operation. We can consider doing a repartition by 
the partition columns and sorting by the sort key but that will suffer if we 
have a lot of data for a single partition. It would be great to know the number 
of files and the size of data we need to rewrite per partition to make a good 
decision here.
   
   > If the table does not have a sort order, then add a default sort by _file, 
_pos, partition columns, and the MERGE condition's references.
   
   Sorting updated records by `_file` and `_pos` may be a bit tricky. For 
example, I have a file with columns (p, c1, c2, c3) in partition 'A' that is 
sorted by c1 and c2. If I have a merge command that updates c2 column (part of 
my sort key), my new records will be probably out of order if I sort by `_file` 
and `_pos`. That said, this is a fallback scenario so it may be not that big a 
deal. 
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] aokolnychyi commented on pull request #1947: [WIP] Spark MERGE INTO Support (copy-on-write implementation)

Reply via email to