> I am having difficulty implementing "no parallel execution" guarantee --
> if worker (or connection to it) goes down I need to recognize this in
> Coordinator, "pause" all jobs given worker was running and (after some
> timeout or user action) re submit jobs to another worker. Timeout (or user
> action) is required to allow worker (if it is alive) to detect network
> error and stop it's jobs and start the cycle again (try to register self
> with Coordinator, etc). It is important that once connection was deemed as
> broken -- it never reused(or worker may not notice the problem), worker is
> treated as dead until it re-registers itself (after a job purge or
> restart).

gRPC doesn't have these sort of intrinsics.

The interesting part here smells like a variation on distributed locking.
You may want to look at something like ZooKeeper.

You could use gRPC messages to do things like communicate the lock names.

Christopher Warrington

