rdblue commented on pull request #1947: URL: https://github.com/apache/iceberg/pull/1947#issuecomment-748244697
> Skewness of partitions is an orthogonal problem and not specific to MERGE INTO , am i right ? The full outer join probably requires shuffling data, which means that it will be distributed by the MATCH expression. There's no guarantee that the match expression is aligned with the table partitioning. If it isn't, then writing without a sort would introduce a ton of small files because each task would be writing to each output partition. To avoid the small files problem, we need to repartition. If we repartition by just the partition expressions from the table, there is a good chance of producing a plan with too few tasks in the write stage because Spark can't split tasks for the same key. This is what introduces the skew. To avoid that, we can use a global sort to plan tasks that are balanced. A global sort is a best practice for writing anyway because it clusters data for faster reads. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
