Github user lins05 commented on the issue:
https://github.com/apache/spark/pull/17750
> Do you then think it would be a viable option to enable it by default on
Coarse grained and have it not used in Fine-grained.
SGTM, especially considering fine-grained mode is already deprecated.
> Could you expand on this a bit more, I assume we could maintain the state
of the tasks similar to how driver state is maintained in
MesosClusterScheduler, and accordingly update state at crucial points, like
start.
I don't think it's an easy task at all, because the spark driver is not
designed to recover from crash.
The state in the MesosClusterScheduler is pretty simple. It's just a REST
server that accepts requests from clients and launches spark drivers on their
behalf. And it just need to persist its mesos framework id, because it need to
re-register with mesos master with the same framework id if it's restarted. In
the current implementation MesosClusterScheduler uses zookeeper as the persist
storage. Aside from that, the MesosClusterScheduler has no other stateful
information.
The spark driver is totally different, because it contains lots of stateful
information: the job/stage/task info, executors info, catalog that holds
temporary views, to name a few. And all those are kept in the driver's memory
and would be lost whenever the driver crashes. So it doesn't make sense to set
`failover_timeout` at all, because spark driver doesn't support fail over.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]