squito commented on issue #20640: [SPARK-19755][Mesos] Blacklist is always active for MesosCoarseGrainedSchedulerBackend URL: https://github.com/apache/spark/pull/20640#issuecomment-537238923 my reluctance to merge this in the past is that without SPARK-24567, it kinda seems like a step backwards to me. If mesos can't start a mesos-task (aka a spark-executor), before this change, we'd stop trying to place more executors there. The problem with the existing code is that the node is rejected indefinitely; I understand why you want to bring in the timeout. But you're only bringing in the timeout for those cases where the executors start successfully, but tasks still consistently fail. Eg. if there is one bad disk out of 10. But OTOH, if there is something else fundamentally wrong with the node that prevents any executor from starting (eg. missing libs), then the behavior becomes worse. I don't use mesos at all myself so I dunno whether mesos somehow shields you from one type of problem which still makes this worthwhile. But without that confidence, I didn't feel great about merging this.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
