brian wickman created AURORA-409:
------------------------------------
Summary: Executor exits with unacknowledged updates while the
slave is down, resulting in LOST tasks.
Key: AURORA-409
URL: https://issues.apache.org/jira/browse/AURORA-409
Project: Aurora
Issue Type: Bug
Components: Executor
Reporter: brian wickman
Originally filed by [~bmahler]
Currently, it appears as though Thermos will attempt to send status updates
while the slave is down. This is correct, as the executor driver will re-send
unacknowledged updates when the slave reconnects.
However, since Thermos does not wait for re-registered(), it's possible for
Thermos to exit before the slave reconnects and the driver flushes
unacknowledged updates.
To ensure updates are sent to the slave, Thermos must wait for reregistered()
before exiting, if disconnected() was called. That is, in between
disconnected() and re-registered(), Thermos must not send status updates and
exit if reliable status updates are desired.
--
This message was sent by Atlassian JIRA
(v6.2#6252)