tgravescs commented on issue #26633: [SPARK-29994][CORE] Add WILDCARD task location URL: https://github.com/apache/spark/pull/26633#issuecomment-558767028 Fixing the scheduler locality algorithm is definitely more changes. The locality delay to me should be a per task delay, if a task doesn't get scheduled in 3 seconds then fall to the next locality. Right now it waits for any tasks to not be scheduled for 3 seconds at that locality. I know Kay has an argument for the FairScheduler use case but I don't know that I agree with that or that it isn't handled by the per task delay. If you really want your task to wait that long for locality you can simple set it higher. I'm not sure the code changes required to make that change though and if we really wanted to leave the old way in there with a config how ugly the code gets. | But the problem is when we have less mappers (from the shuffle map stage) than the number of worker nodes, e.g., 5 vs. 10, and if we stick to the preferred locations, the LocalShuffledRowRDD will suffer from locality wait and be even slower than the original ShuffledRowRDD. I'm not sure I follow this statement. If you have less mappers -lets say you have 5 and you have 10 worker nodes (assuming this is standalone mode - or do you mean executors?) - the 5 maps will run one 5 of those nodes. Your LocalShuffledRowRDD uses the map output location as the preferred locations so why wouldn't the scheduler schedule on those nodes? Are you saying the 10 workers nodes (not sure if you mean executors or workers?) are being used by others (either job or stage) and some might be busy and the delay on waiting is more then just reading over the network? Is this case with dynamic allocation or not? It sounds to me like the normal case of you ran on some executors, you may not have the same executors when your reduce phase runs so you are being delayed scheduling because you can't get node locality. I think you could have the same thing with shuffledRowRDD with a small number maps/reducers. The issue I want to understand is why are we special casing this one RDD for a performance improvement when in my opinion the majority of jobs would get a benefit from not having to wait for the locality (as implemented today). Changing the default like Imran mentioned might be a good first step and then fixing the algorithm would be the second in my opinion. Do you think a default of 0 node locality would solve your problem? Obviously if a user does set it then it gets applied again though.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
