tgravescs commented on issue #26633: [SPARK-29994][CORE] Add WILDCARD task 
location
URL: https://github.com/apache/spark/pull/26633#issuecomment-558827310
 
 
   Thanks for the explanation.
   
   >  There can be exceptions where locality does matter a lot and it would be 
worth some wait time.
   
   What are these cases? I'm sure there are but based on what myself and many 
others I've talked to, its the exception instead of the rule.  Doesn't matter 
whether its HDFS data or shuffle data.  People are setting this to zero now 
anyway so changing default makes sense to me.
   
   >  we'd like to have 200 output partitions too, which means there will be 
200 / 5 = 40 tasks for fetching data from each mapper.
   
   I don't follow this logic how do you go from 200 output partitions to 40 
tasks?  I would expect 200 output partitions to have 200 tasks. Doesn't matter 
to much as the main issue is your next sentence.
   
   >   all these 200 tasks will be scheduled on 5 hosts only, while the other 5 
hosts will not share the workload
   
   Now this part I understand.  But goes back to what I said before, I don't 
see how this is any different  then any other RDD.   About a month ago, we 
specifically had a job that during shuffle was hitting this same thing. We had 
20 nodes, but only 1 node was being scheduling on and it had significant impact 
on the job time, we set locality delay to 0 and worked around the issue. 
Another example, running on YARN. Let say I have 10 tasks reading hdfs data, I 
get 10 nodes, 5 of those nodes actually have HDFS blocks on them.  With 
locality turned on those 5 nodes will be loaded up and depending on how long 
they take could keep the other 5 tasks from running as quickly as they should.
   
   So what happens if these 5 mappers have very large data or skewed data?  Is 
it better to skip locality?  That is going to add to the network usage - who is 
to say I want that vs waiting? It might be that the tasks take long enough that 
it actually does fall back to rack locality - it depends on the harmonics of 
when tasks finish and are scheduled.  I'm willing to bet that in general having 
locality delay 0 still is more performant, which is why we go back to locality 
delay 0 for a default. If they actually need locality then they can turn it 
back on.  Now that will also affect your RDD here as well - but this seems more 
of a very specific case.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to