[GitHub] [spark] tgravescs commented on issue #25299: [SPARK-27651][Core] Avoid the network when shuffle blocks are fetched from the same host

GitBox Wed, 04 Sep 2019 06:50:14 -0700

tgravescs commented on issue #25299: [SPARK-27651][Core] Avoid the network when 
shuffle blocks are fetched from the same host
URL: https://github.com/apache/spark/pull/25299#issuecomment-527909451
 
 
   So the downside to using a disk is that at least on yarn containers may not 
see the same disks. one containers might start and get all the disks.  Then a 
disk perhaps gets temporarily full and another container starts and it has all 
the disks minus that full one.  The other downside is that you have to go scan 
all the disks in your list to find whatever file you are using to communicate 
with, so the question is scanning all the disks slower then getting the list 
from the driver.  I know some people run with 20+ disks, although at my 
previous job most hosts were 4 to 6.  It would definitely be nice to not have 
to go to the driver though.  We could try a hybrid approach where it looks for 
a file in the application level directory by scanning the disks and uses that 
if it finds it and if it doesn't falls back to the shuffle service or falls 
back to asking the driver.  I would hope its fairly rare to not find it so just 
falling back to the shuffle service would make it simpler.  I'm definitely in 
favor of not hitting the driver if we can help it.
   
   Really on yarn, other then the issue of different containers getting 
different disks, there is no reason you need to know the local dirs at all as 
it could just scan the disks, but that would require change to the hashing of 
filename code in the blockmanager.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] tgravescs commented on issue #25299: [SPARK-27651][Core] Avoid the network when shuffle blocks are fetched from the same host

Reply via email to