tgravescs commented on pull request #28287: URL: https://github.com/apache/spark/pull/28287#issuecomment-619031231
it might have been discussions no on PR or buried in comments, I don't have time to go looking. there are many different conditions to consider. The main one we were focusing on is that you had as many executors as you needed to execute the task you had left. This means the allocation manager was not going to ask for more. The problem is that some or all of those executor can get blacklisted. The only way for the dynamic allocation manager to know it needs to ask for more is for it to know that nodes are blacklisted and it needs to ask for some more executors - thus internally incrementing is count of executors needed and asking yarn or other resource manager for more. so now the number of executors the allocation manager is different then what its normal calculations would figure out. simplified: #executors = (#tasks * #cpus per task/#cores per executor). So you have to change the number of executors and you have to keep taking that it account because the allocation manager is always trying to calculate if it needs more or less executors. You also have to notify it when executors become unblacklisted, or perhaps the ones that were blacklisted idle timeout, etc. The allocation manager has to know a lot more details about the blacklisting and take that into account when its calculating the number of executors it needs. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
