[GitHub] [iceberg] rdblue commented on pull request #1947: [WIP] Spark MERGE INTO Support (copy-on-write implementation)

GitBox Fri, 18 Dec 2020 10:25:04 -0800


rdblue commented on pull request #1947:
URL: https://github.com/apache/iceberg/pull/1947#issuecomment-748244697



   > Skewness of partitions is an orthogonal problem and not specific to MERGE 
INTO , am i right ?
   
   The full outer join probably requires shuffling data, which means that it 
will be distributed by the MATCH expression. There's no guarantee that the 
match expression is aligned with the table partitioning. If it isn't, then 
writing without a sort would introduce a ton of small files because each task 
would be writing to each output partition.
   
   To avoid the small files problem, we need to repartition. If we repartition 
by just the partition expressions from the table, there is a good chance of 
producing a plan with too few tasks in the write stage because Spark can't 
split tasks for the same key. This is what introduces the skew. To avoid that, 
we can use a global sort to plan tasks that are balanced.
   
   A global sort is a best practice for writing anyway because it clusters data 
for faster reads.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] rdblue commented on pull request #1947: [WIP] Spark MERGE INTO Support (copy-on-write implementation)

Reply via email to