maryannxue commented on issue #26633: [SPARK-29994][CORE] Add WILDCARD task location URL: https://github.com/apache/spark/pull/26633#issuecomment-558850004 > I don't follow this logic how do you go from 200 output partitions to 40 tasks? I would expect 200 output partitions to have 200 tasks. Doesn't matter to much as the main issue is your next sentence. It's 200 tasks overall, but each mapper has 50. That simple, but doesn't really matter. > what kind of performance impact do you see if you just don't set preferred locations at all in your RDD? It would be no different from ShuffledRowRDD, and why would we bother to do the LocalShuffledRowRDD in the first place. > But goes back to what I said before, I don't see how this is any different then any other RDD. It is no different from any other RDDs (you mentioned). The only difference is that this RDD has a definitive "baseline" or "goal": it looks to perform no worse than a regular shuffle and better if possible. For other RDDs, I can't say what the target is, and what performance impact is considered acceptable. Yes, setting the locality wait to 0 would solve the problem of LocalShuffledRowRDD perfectly, and the effect is equivalent to this PR's proposal on a single task set alone. This leaves us only one difference of opinion: the exceptions, which I choose not to disclose.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
