brian wickman created AURORA-409:
------------------------------------

             Summary: Executor exits with unacknowledged updates while the 
slave is down, resulting in LOST tasks.
                 Key: AURORA-409
                 URL: https://issues.apache.org/jira/browse/AURORA-409
             Project: Aurora
          Issue Type: Bug
          Components: Executor
            Reporter: brian wickman


Originally filed by [~bmahler]

Currently, it appears as though Thermos will attempt to send status updates 
while the slave is down. This is correct, as the executor driver will re-send 
unacknowledged updates when the slave reconnects.

However, since Thermos does not wait for re-registered(), it's possible for 
Thermos to exit before the slave reconnects and the driver flushes 
unacknowledged updates.

To ensure updates are sent to the slave, Thermos must wait for reregistered() 
before exiting, if disconnected() was called. That is, in between 
disconnected() and re-registered(), Thermos must not send status updates and 
exit if reliable status updates are desired.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to