[jira] [Commented] (MESOS-5332) TASK_LOST on slave restart potentially due to executor race condition

Anand Mazumdar (JIRA) Thu, 05 May 2016 12:55:59 -0700

    [ 
https://issues.apache.org/jira/browse/MESOS-5332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15272952#comment-15272952
 ]


Anand Mazumdar commented on MESOS-5332:
---------------------------------------

[~StephanErb] Thanks for reporting this issue after the initial discussion with 
Vinod on IRC. Before it's too late, would it be possible to upload the full 
executor logs (for the failed scenario)? I want to rule out the possibility 
that for some reason we might have got duplicated {{ExitedEvent}}'s for the 
agent process restart.

Also, it would be good to have the full agent logs when you saw the above 
occurrence for the failed executor.  Also, on an average how large was your 
{{<work_dir>/meta/slaves/latest/frameworks/<framework-id>/executors}} when you 
were seeing this behavior? 

> TASK_LOST on slave restart potentially due to executor race condition
> ---------------------------------------------------------------------
>
>                 Key: MESOS-5332
>                 URL: https://issues.apache.org/jira/browse/MESOS-5332
>             Project: Mesos
>          Issue Type: Bug
>          Components: libprocess, slave
>    Affects Versions: 0.26.0
>         Environment: Mesos 0.26
> Aurora 0.13
>            Reporter: Stephan Erb
>
> When restarting the Mesos agent binary, tasks can end up as LOST. We lose 
> from 20% to 50% of all tasks. They are killed by the Mesos agent via:
> {code}
> I0505 08:42:06.781318 21738 slave.cpp:2702] Cleaning up un-reregistered 
> executors
> I0505 08:42:06.781366 21738 slave.cpp:2720] Killing un-reregistered executor 
> 'thermos-nobody-devel-service-28854-0-6a88d62e-656
> 4-4e33-b0bb-1d8039d97afc' of framework 
> 20151001-085346-58917130-5050-37976-0000 at executor(1)@10.X.X.X:40541
> I0505 08:42:06.781446 21738 slave.cpp:2720] Killing un-reregistered executor 
> 'thermos-nobody-devel-service-23839-0-1d2cd0e6-699
> 4-4cba-a9df-3dfc1552667f' of framework 
> 20151001-085346-58917130-5050-37976-0000 at executor(1)@10.X.X.X:35757
> I0505 08:42:06.781466 21738 slave.cpp:2720] Killing un-reregistered executor 
> 'thermos-nobody-devel-service-29970-0-478a7291-d070-4aa8
> -af21-6fda889f750c' of framework 20151001-085346-58917130-5050-37976-0000 at 
> executor(1)@10.X.X.X:51463
> ...
> I0505 08:42:06.781558 21738 slave.cpp:4230] Finished recovery
> {code}
> We have verified that the tasks and their executors are killed by the agent 
> during startup. When stopping the agent using supervisorctl stop, the 
> executors are still running (verified via {{ps aux}}). They are only killed 
> once the agent tries to reregister.
> The issue is hard to reproduce:
> * When restarting the agent binary multiple times, tasks are only lost for 
> the first restart.
> * It is much more likely to occur if the agent binary has been running for a 
> longer period of time (> 7 days)
> * It tends to be more likely if the host has many cores (30-40) and thus many 
> libprocess workers. 
> Mesos is correctly sticking to the 2 seconds wait time before killing 
> un-reregistered executors. The failed executors receive the reregistration 
> request, but it seems like they fail to send a reply.
> A successful reregistration (not leading to LOST):
> {code}
> I0505 08:41:59.581231 21664 exec.cpp:456] Slave exited, but framework has 
> checkpointing enabled. Waiting 15mins to reconnect with slave 
> 20160118-141153-92471562-5050-6270-S17
> I0505 08:42:04.780591 21665 exec.cpp:256] Received reconnect request from 
> slave 20160118-141153-92471562-5050-6270-S17
> I0505 08:42:04.785297 21676 exec.cpp:233] Executor re-registered on slave 
> 20160118-141153-92471562-5050-6270-S17
> I0505 08:42:04.788579 21676 exec.cpp:245] Executor::reregistered took 
> 1.492339ms
> {code}
> A failed one:
> {code}
> I0505 08:42:04.779677  2389 exec.cpp:256] Received reconnect request from 
> slave 20160118-141153-92471562-5050-6270-S17
> E0505 08:42:05.481374  2408 process.cpp:1911] Failed to shutdown socket with 
> fd 11: Transport endpoint is not connected
> I0505 08:42:05.481374  2395 exec.cpp:456] Slave exited, but framework has 
> checkpointing enabled. Waiting 15mins to reconnect with slave 
> 20160118-141153-92471562-5050-6270-S17
> {code}
> All task ending up in LOST have an output similar to the one posted above, 
> i.e. messages seem to be received in a wrong order.
> Anyone an idea what might be going on here? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-5332) TASK_LOST on slave restart potentially due to executor race condition

Reply via email to