fitermay commented on issue #23986: [SPARK-27070] Fix performance bug in DefaultPartitionCoalescer URL: https://github.com/apache/spark/pull/23986#issuecomment-470621102 @srowen @attilapiros After digging into this further I found out what exactly is happening and why sorting here causes a major issue. It runs out EMRFS returns the string '*' as the host of each block. This ends up invoking the worst case of this algorithm where it tries to jam everything into the same preferred partition. In turn, this ends up running sort on hundreds thousands of records each iteration to find the minimum. I've contacted the EMR team to suggest changing the host to 'localhost' but apparently that would break MR performance on Yarn. I still think this patch is a win because: 1) It's actually simpler and less code than the pre-patch code 2) There are lots of EMR users who would benefit from this until a strategic solution is found 3) It improves performance in less extreme cases as well I will try to make the suggested changes and also generate some performance numbers for the extreme case tonight. Thanks!
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
