[ 
https://issues.apache.org/jira/browse/MESOS-5332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb updated MESOS-5332:
-------------------------------
    Description: 
When restarting the Mesos agent binary, tasks can end up as LOST. We lose from 
20% to 50% of all tasks. They are killed by the Mesos agent via:

{code}
I0505 08:42:06.781318 21738 slave.cpp:2702] Cleaning up un-reregistered 
executors
I0505 08:42:06.781366 21738 slave.cpp:2720] Killing un-reregistered executor 
'thermos-nobody-devel-service-28854-0-6a88d62e-656
4-4e33-b0bb-1d8039d97afc' of framework 20151001-085346-58917130-5050-37976-0000 
at executor(1)@10.X.X.X:40541
I0505 08:42:06.781446 21738 slave.cpp:2720] Killing un-reregistered executor 
'thermos-nobody-devel-service-23839-0-1d2cd0e6-699
4-4cba-a9df-3dfc1552667f' of framework 20151001-085346-58917130-5050-37976-0000 
at executor(1)@10.X.X.X:35757
I0505 08:42:06.781466 21738 slave.cpp:2720] Killing un-reregistered executor 
'thermos-nobody-devel-service-29970-0-478a7291-d070-4aa8
-af21-6fda889f750c' of framework 20151001-085346-58917130-5050-37976-0000 at 
executor(1)@10.X.X.X:51463
...
I0505 08:42:06.781558 21738 slave.cpp:4230] Finished recovery
{code}


We have verified that the tasks and their executors are killed by the agent 
during startup. When stopping the agent using supervisorctl stop, the executors 
are still running (verified via {{ps aux}}). They are only killed once the 
agent tries to reregister.

The issue is hard to reproduce:

* When restarting the agent binary multiple times, tasks are only lost for the 
first restart.
* It is much more likely to occur if the agent binary has been running for a 
longer period of time (> 7 days)


Mesos is correctly sticking to the 2 seconds wait time before killing 
un-reregistered executors. The failed executors receive the reregistration 
request, but it seems like they fail to send a reply.

A successful reregistration (not leading to LOST):
{code}
I0505 08:41:59.581231 21664 exec.cpp:456] Slave exited, but framework has 
checkpointing enabled. Waiting 15mins to reconnect with slave 
20160118-141153-92471562-5050-6270-S17
I0505 08:42:04.780591 21665 exec.cpp:256] Received reconnect request from slave 
20160118-141153-92471562-5050-6270-S17
I0505 08:42:04.785297 21676 exec.cpp:233] Executor re-registered on slave 
20160118-141153-92471562-5050-6270-S17
I0505 08:42:04.788579 21676 exec.cpp:245] Executor::reregistered took 1.492339ms
{code}

A failed one:
{code}
I0505 08:42:04.779677  2389 exec.cpp:256] Received reconnect request from slave 
20160118-141153-92471562-5050-6270-S17
E0505 08:42:05.481374  2408 process.cpp:1911] Failed to shutdown socket with fd 
11: Transport endpoint is not connected
I0505 08:42:05.481374  2395 exec.cpp:456] Slave exited, but framework has 
checkpointing enabled. Waiting 15mins to reconnect with slave 
20160118-141153-92471562-5050-6270-S17
{code}

All task ending up in LOST have an output similar to the one posted above, i.e. 
log messages are in a wrong order.

Anyone an idea what might be going on here? 

  was:
When restarting the Mesos agent binary, tasks can end up as LOST. We lose from 
20% to 50% of all tasks. They are killed by the Mesos agent via:

{code}
I0505 08:42:06.781318 21738 slave.cpp:2702] Cleaning up un-reregistered 
executors
I0505 08:42:06.781366 21738 slave.cpp:2720] Killing un-reregistered executor 
'thermos-nobody-devel-service-28854-0-6a88d62e-656
4-4e33-b0bb-1d8039d97afc' of framework 20151001-085346-58917130-5050-37976-0000 
at executor(1)@10.X.X.X:40541
I0505 08:42:06.781446 21738 slave.cpp:2720] Killing un-reregistered executor 
'thermos-nobody-devel-service-23839-0-1d2cd0e6-699
4-4cba-a9df-3dfc1552667f' of framework 20151001-085346-58917130-5050-37976-0000 
at executor(1)@10.X.X.X:35757
I0505 08:42:06.781466 21738 slave.cpp:2720] Killing un-reregistered executor 
'thermos-nobody-devel-service-29970-0-478a7291-d070-4aa8
-af21-6fda889f750c' of framework 20151001-085346-58917130-5050-37976-0000 at 
executor(1)@10.X.X.X:51463
...
I0505 08:42:06.781558 21738 slave.cpp:4230] Finished recovery
{code}


We have verified that the tasks and their executors are killed by the agent 
during startup. When stopping the agent using supervisorctl stop, the executors 
are still running (verified via {{ps aux}}). They are only killed once the 
agent tries to reregister.

The issue is hard to reproduce:

* When restarting the agent binary multiple times, tasks are only lost for the 
first restart.
* It is much more likely to occur if the agent binary has been running for a 
longer period of time (> 7 days)
* It tends to be more likely if the host has many cores (30-40) and thus many 
libprocess workers. 


Mesos is correctly sticking to the 2 seconds wait time before killing 
un-reregistered executors. The failed executors receive the reregistration 
request, but it seems like they fail to send a reply.

A successful reregistration (not leading to LOST):
{code}
I0505 08:41:59.581231 21664 exec.cpp:456] Slave exited, but framework has 
checkpointing enabled. Waiting 15mins to reconnect with slave 
20160118-141153-92471562-5050-6270-S17
I0505 08:42:04.780591 21665 exec.cpp:256] Received reconnect request from slave 
20160118-141153-92471562-5050-6270-S17
I0505 08:42:04.785297 21676 exec.cpp:233] Executor re-registered on slave 
20160118-141153-92471562-5050-6270-S17
I0505 08:42:04.788579 21676 exec.cpp:245] Executor::reregistered took 1.492339ms
{code}

A failed one:
{code}
I0505 08:42:04.779677  2389 exec.cpp:256] Received reconnect request from slave 
20160118-141153-92471562-5050-6270-S17
E0505 08:42:05.481374  2408 process.cpp:1911] Failed to shutdown socket with fd 
11: Transport endpoint is not connected
I0505 08:42:05.481374  2395 exec.cpp:456] Slave exited, but framework has 
checkpointing enabled. Waiting 15mins to reconnect with slave 
20160118-141153-92471562-5050-6270-S17
{code}

All task ending up in LOST have an output similar to the one posted above, i.e. 
messages seem to be received in a wrong order.

Anyone an idea what might be going on here? 


> TASK_LOST on slave restart potentially due to executor race condition
> ---------------------------------------------------------------------
>
>                 Key: MESOS-5332
>                 URL: https://issues.apache.org/jira/browse/MESOS-5332
>             Project: Mesos
>          Issue Type: Bug
>          Components: libprocess, slave
>    Affects Versions: 0.26.0
>         Environment: Mesos 0.26
> Aurora 0.13
>            Reporter: Stephan Erb
>         Attachments: executor-logs.tar.gz, executor-stderr.log, 
> executor-stderrV2.log, mesos-slave.log
>
>
> When restarting the Mesos agent binary, tasks can end up as LOST. We lose 
> from 20% to 50% of all tasks. They are killed by the Mesos agent via:
> {code}
> I0505 08:42:06.781318 21738 slave.cpp:2702] Cleaning up un-reregistered 
> executors
> I0505 08:42:06.781366 21738 slave.cpp:2720] Killing un-reregistered executor 
> 'thermos-nobody-devel-service-28854-0-6a88d62e-656
> 4-4e33-b0bb-1d8039d97afc' of framework 
> 20151001-085346-58917130-5050-37976-0000 at executor(1)@10.X.X.X:40541
> I0505 08:42:06.781446 21738 slave.cpp:2720] Killing un-reregistered executor 
> 'thermos-nobody-devel-service-23839-0-1d2cd0e6-699
> 4-4cba-a9df-3dfc1552667f' of framework 
> 20151001-085346-58917130-5050-37976-0000 at executor(1)@10.X.X.X:35757
> I0505 08:42:06.781466 21738 slave.cpp:2720] Killing un-reregistered executor 
> 'thermos-nobody-devel-service-29970-0-478a7291-d070-4aa8
> -af21-6fda889f750c' of framework 20151001-085346-58917130-5050-37976-0000 at 
> executor(1)@10.X.X.X:51463
> ...
> I0505 08:42:06.781558 21738 slave.cpp:4230] Finished recovery
> {code}
> We have verified that the tasks and their executors are killed by the agent 
> during startup. When stopping the agent using supervisorctl stop, the 
> executors are still running (verified via {{ps aux}}). They are only killed 
> once the agent tries to reregister.
> The issue is hard to reproduce:
> * When restarting the agent binary multiple times, tasks are only lost for 
> the first restart.
> * It is much more likely to occur if the agent binary has been running for a 
> longer period of time (> 7 days)
> Mesos is correctly sticking to the 2 seconds wait time before killing 
> un-reregistered executors. The failed executors receive the reregistration 
> request, but it seems like they fail to send a reply.
> A successful reregistration (not leading to LOST):
> {code}
> I0505 08:41:59.581231 21664 exec.cpp:456] Slave exited, but framework has 
> checkpointing enabled. Waiting 15mins to reconnect with slave 
> 20160118-141153-92471562-5050-6270-S17
> I0505 08:42:04.780591 21665 exec.cpp:256] Received reconnect request from 
> slave 20160118-141153-92471562-5050-6270-S17
> I0505 08:42:04.785297 21676 exec.cpp:233] Executor re-registered on slave 
> 20160118-141153-92471562-5050-6270-S17
> I0505 08:42:04.788579 21676 exec.cpp:245] Executor::reregistered took 
> 1.492339ms
> {code}
> A failed one:
> {code}
> I0505 08:42:04.779677  2389 exec.cpp:256] Received reconnect request from 
> slave 20160118-141153-92471562-5050-6270-S17
> E0505 08:42:05.481374  2408 process.cpp:1911] Failed to shutdown socket with 
> fd 11: Transport endpoint is not connected
> I0505 08:42:05.481374  2395 exec.cpp:456] Slave exited, but framework has 
> checkpointing enabled. Waiting 15mins to reconnect with slave 
> 20160118-141153-92471562-5050-6270-S17
> {code}
> All task ending up in LOST have an output similar to the one posted above, 
> i.e. log messages are in a wrong order.
> Anyone an idea what might be going on here? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to