Thank You Adam. This is really helpful in validating some of the choices made in the Myriad HA design which I'll submit for review shortly.
Regards Swapnil On Fri, Jul 31, 2015 at 3:07 AM, Adam Bordelon <[email protected]> wrote: > 1) Mesos trusts custom executor to report task status. > Correct, as long as the executor is still running. > > 2) Mesos does not use task status as a heartbeat. > Correct. A task could start RUNNING, then provide no other status updates > for months and Mesos will assume it is still running as long as there were > no terminal status updates sent, the executor is still running, and the > slave is still connected. > However, you can optionally add health checks (HTTP or command) to your > tasks, and Mesos will report the health back in periodic status updates. > But it's up to your framework to determine how to interpret an "unhealthy" > state. > > 3) If an executor dies, Mesos thinks all tasks launched by that executor > are lost. > Correct. However, there is a long-standing issue (MESOS-313) that > executorLost is never actually passed onto the scheduler. You will get a > TASK_LOST for each task though. > > On Fri, Jul 31, 2015 at 1:07 AM, Swapnil Daingade < > [email protected]> wrote: > > > Hi All, > > > > I am looking to verify if my understanding of Task failures and executor > > failures in Mesos is correct. > > > > I am assuming the following > > > > * Mesos trusts custom executor to report task status. > > If a task completes/fails, but executor does not call > > ExecutorDriver.sendStatusUpdate() with TASK_COMPLETE/TASK_FAILED then > > Mesos will assume that the task is still running. > > > > * Mesos does not use task status sent using call to ExecutorDriver. > > sendStatusUpdate as a heartbeat. > > For E.g. in MyriadExecutor we report the NMTask status as TASK_RUNNING > > after launching the > > NM. We report TASK_COMPLETE/TASK_FAILED only after the process has > > terminated. There is no call to ExecutorDriver.sendStatusUpdate() in > > between. I am assuming that this does not cause Mesos to think that the > > task has been lost after some timeout interval. > > > > * If an executor dies, Mesos thinks all tasks launched by that executor > are > > lost. Scheduler will receive one call to executorLost() and > > statusUpdate()'s with state set to TASK_LOST for every Task launched by > > that executor. > > > > Please let me know if any of my assumptions are incorrect. > > > > Regards > > Swapnil > > >
