Github user timout commented on the issue:
https://github.com/apache/spark/pull/17619
That does exactly what is supposed to do. And you absolutely right it
related to executors.
I am sorry if it is not clear from my previous explanations.
Let us say:
Spark Streaming App - very long running app:
Driver, started by marathon using docker image, schedules (in mesos
meaning) executors using
docker images.(net=HOST) (every executor started from docker image on some
mesos agent)
So if some recoverable error happens, for instance:
ExecutorLostFailure (executor 40 exited caused by one of the running tasks)
Reason: Remote RPC client disassociated...(I do not know how about others but
it is relatively often in my env.)
As result the executor will be dead and after 2 failures mesos agent node
will be included in MesosCoarseGrainedSchedulerBackend black list and driver
will never schedule (in mesos meaning) executor on it. So the app will
starve... and notice will not die.
That exactly what happened with my streams apps before that patch.
That patch may be incompatible with master already but i can fix it if
needed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]