tgravescs commented on issue #25299: [SPARK-27651][Core] Avoid the network when shuffle blocks are fetched from the same host URL: https://github.com/apache/spark/pull/25299#issuecomment-527909451 So the downside to using a disk is that at least on yarn containers may not see the same disks. one containers might start and get all the disks. Then a disk perhaps gets temporarily full and another container starts and it has all the disks minus that full one. The other downside is that you have to go scan all the disks in your list to find whatever file you are using to communicate with, so the question is scanning all the disks slower then getting the list from the driver. I know some people run with 20+ disks, although at my previous job most hosts were 4 to 6. It would definitely be nice to not have to go to the driver though. We could try a hybrid approach where it looks for a file in the application level directory by scanning the disks and uses that if it finds it and if it doesn't falls back to the shuffle service or falls back to asking the driver. I would hope its fairly rare to not find it so just falling back to the shuffle service would make it simpler. I'm definitely in favor of not hitting the driver if we can help it. Really on yarn, other then the issue of different containers getting different disks, there is no reason you need to know the local dirs at all as it could just scan the disks, but that would require change to the hashing of filename code in the blockmanager.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
