maryannxue commented on issue #26633: [SPARK-29994][CORE] Add WILDCARD task 
location
URL: https://github.com/apache/spark/pull/26633#issuecomment-558850004
 
 
   > I don't follow this logic how do you go from 200 output partitions to 40 
tasks? I would expect 200 output partitions to have 200 tasks. Doesn't matter 
to much as the main issue is your next sentence.
   
   It's 200 tasks overall, but each mapper has 50. That simple, but doesn't 
really matter.
   
   > what kind of performance impact do you see if you just don't set preferred 
locations at all in your RDD?
   
   It would be no different from ShuffledRowRDD, and why would we bother to do 
the LocalShuffledRowRDD in the first place.
   
   > But goes back to what I said before, I don't see how this is any different 
then any other RDD.
   
   It is no different from any other RDDs (you mentioned). The only difference 
is that this RDD has a definitive "baseline" or "goal": it looks to perform no 
worse than a regular shuffle and better if possible. For other RDDs, I can't 
say what the target is, and what performance impact is considered acceptable.
   
   Yes, setting the locality wait to 0 would solve the problem of 
LocalShuffledRowRDD perfectly, and the effect is equivalent to this PR's 
proposal on a single task set alone. This leaves us only one difference of 
opinion: the exceptions, which I choose not to disclose.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to