tgravescs commented on issue #26633: [SPARK-29994][CORE] Add WILDCARD task location URL: https://github.com/apache/spark/pull/26633#issuecomment-558898559 I agree the default setting change needs to happen in a bigger conversation, but if that conversation is going to happen we shouldn't check this in until that is had in my opinion. I have not seen a real argument why this RDD is different than any other. But if we fix the real issue with locality then it helps everything. The argument that its a special version of ShuffledRowRDD and that sometimes you hit this locality issue doesn't convince me. I can hit the locality issue with ShuffledRowRDD, I might not hit the issue with the LocalShuffleRowRDD. Why not change ShuffledRowRDD or HadoopRDD to use this as well because I can hit the same issue? The only argument I can see is limited scope, but at the same time does it only turn it on then when you hit the case described with mappers < reducers and I have more executors then mappers? If it turns it on more than that, then one could argue you aren't following the semantics defined by Spark for locality wait. I don't see any concrete numbers here on performance impact or how much this affects users or why we should special case this? If it has a huge impact then I can see why we would special case it but I haven't seen any evidence of that. Do we have any cases this is seen in production - is there negative impact of user just setting node locality wait = 0? Again the main issue I have is that once it's introduced anyone can use it in an RDD - therefore I consider it a public interface. You say its limited impact and only used by adaptive execution but once introduced nothing stopping others from using it. Adding more people to get opinions. @vanzin @dongjoon-hyun @srowen
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
