tgravescs commented on issue #26633: [SPARK-29994][CORE] Add WILDCARD task location URL: https://github.com/apache/spark/pull/26633#issuecomment-559124754 Nobody has answered my questions above as to why this RDD should be treated differently and the impact of this. You just keep saying this is for adaptive scheduling. As far as I can see, this is purely another instance of https://issues.apache.org/jira/browse/SPARK-18886 and I don't see why we aren't using the same workaround or really fixing the real issue. > As I pointed out in the very beginning #26633 (comment), each RDD should know their own locality preference as well as the importance of such locality. If we ever had to worry about this location being used improperly, we'd have to worry about if any other regular location is returned correctly by the RDD as well. I don't agree. HadoopRDD for instance knows its locality, but how important the locality is very user/cluster specific. I don't see how the LocalShuffledRowRDD is any different. You are saying the user never cares about the locality on this - please explain to me why and how it is different from HadoopRDD? If we were to turn this on for HadoopRDD though then we would essentially be bypassing the locality settings. > Even if we do it, people that run jobs that need delay scheduling still need to set the locality wait. For these users, we need this WILDCARD location feature to enable AQE. Again why is AQE different? lets say I really want my HadoopRDD to use locality but then the shuffledRDD hits this issue. As a user I can't just turn locality off for my shuffleRDD so what makes the LocalShuffledRowRDD any different? From what has been described here, this is a very particular case. You have more nodes and reducers then maps, the maps finish very quickly (probably within 3 seconds), these are the same conditions other RDDs can hit the same issue
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
