[ https://issues.apache.org/jira/browse/MESOS-5332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Stephan Erb updated MESOS-5332: ------------------------------- Description: When restarting the Mesos agent binary, tasks can end up as LOST. We lose from 20% to 50% of all tasks. They are killed by the Mesos agent via: {code} I0505 08:42:06.781318 21738 slave.cpp:2702] Cleaning up un-reregistered executors I0505 08:42:06.781366 21738 slave.cpp:2720] Killing un-reregistered executor 'thermos-nobody-devel-service-28854-0-6a88d62e-656 4-4e33-b0bb-1d8039d97afc' of framework 20151001-085346-58917130-5050-37976-0000 at executor(1)@10.X.X.X:40541 I0505 08:42:06.781446 21738 slave.cpp:2720] Killing un-reregistered executor 'thermos-nobody-devel-service-23839-0-1d2cd0e6-699 4-4cba-a9df-3dfc1552667f' of framework 20151001-085346-58917130-5050-37976-0000 at executor(1)@10.X.X.X:35757 I0505 08:42:06.781466 21738 slave.cpp:2720] Killing un-reregistered executor 'thermos-nobody-devel-service-29970-0-478a7291-d070-4aa8 -af21-6fda889f750c' of framework 20151001-085346-58917130-5050-37976-0000 at executor(1)@10.X.X.X:51463 ... I0505 08:42:06.781558 21738 slave.cpp:4230] Finished recovery {code} We have verified that the tasks and their executors are killed by the agent during startup. When stopping the agent using supervisorctl stop, the executors are still running (verified via {{ps aux}}). They are only killed once the agent tries to reregister. The issue is hard to reproduce: * When restarting the agent binary multiple times, tasks are only lost for the first restart. * It is much more likely to occur if the agent binary has been running for a longer period of time (> 7 days) Mesos is correctly sticking to the 2 seconds wait time before killing un-reregistered executors. The failed executors receive the reregistration request, but it seems like they fail to send a reply. A successful reregistration (not leading to LOST): {code} I0505 08:41:59.581231 21664 exec.cpp:456] Slave exited, but framework has checkpointing enabled. Waiting 15mins to reconnect with slave 20160118-141153-92471562-5050-6270-S17 I0505 08:42:04.780591 21665 exec.cpp:256] Received reconnect request from slave 20160118-141153-92471562-5050-6270-S17 I0505 08:42:04.785297 21676 exec.cpp:233] Executor re-registered on slave 20160118-141153-92471562-5050-6270-S17 I0505 08:42:04.788579 21676 exec.cpp:245] Executor::reregistered took 1.492339ms {code} A failed one: {code} I0505 08:42:04.779677 2389 exec.cpp:256] Received reconnect request from slave 20160118-141153-92471562-5050-6270-S17 E0505 08:42:05.481374 2408 process.cpp:1911] Failed to shutdown socket with fd 11: Transport endpoint is not connected I0505 08:42:05.481374 2395 exec.cpp:456] Slave exited, but framework has checkpointing enabled. Waiting 15mins to reconnect with slave 20160118-141153-92471562-5050-6270-S17 {code} All task ending up in LOST have an output similar to the one posted above, i.e. log messages are in a wrong order. Anyone an idea what might be going on here? was: When restarting the Mesos agent binary, tasks can end up as LOST. We lose from 20% to 50% of all tasks. They are killed by the Mesos agent via: {code} I0505 08:42:06.781318 21738 slave.cpp:2702] Cleaning up un-reregistered executors I0505 08:42:06.781366 21738 slave.cpp:2720] Killing un-reregistered executor 'thermos-nobody-devel-service-28854-0-6a88d62e-656 4-4e33-b0bb-1d8039d97afc' of framework 20151001-085346-58917130-5050-37976-0000 at executor(1)@10.X.X.X:40541 I0505 08:42:06.781446 21738 slave.cpp:2720] Killing un-reregistered executor 'thermos-nobody-devel-service-23839-0-1d2cd0e6-699 4-4cba-a9df-3dfc1552667f' of framework 20151001-085346-58917130-5050-37976-0000 at executor(1)@10.X.X.X:35757 I0505 08:42:06.781466 21738 slave.cpp:2720] Killing un-reregistered executor 'thermos-nobody-devel-service-29970-0-478a7291-d070-4aa8 -af21-6fda889f750c' of framework 20151001-085346-58917130-5050-37976-0000 at executor(1)@10.X.X.X:51463 ... I0505 08:42:06.781558 21738 slave.cpp:4230] Finished recovery {code} We have verified that the tasks and their executors are killed by the agent during startup. When stopping the agent using supervisorctl stop, the executors are still running (verified via {{ps aux}}). They are only killed once the agent tries to reregister. The issue is hard to reproduce: * When restarting the agent binary multiple times, tasks are only lost for the first restart. * It is much more likely to occur if the agent binary has been running for a longer period of time (> 7 days) * It tends to be more likely if the host has many cores (30-40) and thus many libprocess workers. Mesos is correctly sticking to the 2 seconds wait time before killing un-reregistered executors. The failed executors receive the reregistration request, but it seems like they fail to send a reply. A successful reregistration (not leading to LOST): {code} I0505 08:41:59.581231 21664 exec.cpp:456] Slave exited, but framework has checkpointing enabled. Waiting 15mins to reconnect with slave 20160118-141153-92471562-5050-6270-S17 I0505 08:42:04.780591 21665 exec.cpp:256] Received reconnect request from slave 20160118-141153-92471562-5050-6270-S17 I0505 08:42:04.785297 21676 exec.cpp:233] Executor re-registered on slave 20160118-141153-92471562-5050-6270-S17 I0505 08:42:04.788579 21676 exec.cpp:245] Executor::reregistered took 1.492339ms {code} A failed one: {code} I0505 08:42:04.779677 2389 exec.cpp:256] Received reconnect request from slave 20160118-141153-92471562-5050-6270-S17 E0505 08:42:05.481374 2408 process.cpp:1911] Failed to shutdown socket with fd 11: Transport endpoint is not connected I0505 08:42:05.481374 2395 exec.cpp:456] Slave exited, but framework has checkpointing enabled. Waiting 15mins to reconnect with slave 20160118-141153-92471562-5050-6270-S17 {code} All task ending up in LOST have an output similar to the one posted above, i.e. messages seem to be received in a wrong order. Anyone an idea what might be going on here? > TASK_LOST on slave restart potentially due to executor race condition > --------------------------------------------------------------------- > > Key: MESOS-5332 > URL: https://issues.apache.org/jira/browse/MESOS-5332 > Project: Mesos > Issue Type: Bug > Components: libprocess, slave > Affects Versions: 0.26.0 > Environment: Mesos 0.26 > Aurora 0.13 > Reporter: Stephan Erb > Attachments: executor-logs.tar.gz, executor-stderr.log, > executor-stderrV2.log, mesos-slave.log > > > When restarting the Mesos agent binary, tasks can end up as LOST. We lose > from 20% to 50% of all tasks. They are killed by the Mesos agent via: > {code} > I0505 08:42:06.781318 21738 slave.cpp:2702] Cleaning up un-reregistered > executors > I0505 08:42:06.781366 21738 slave.cpp:2720] Killing un-reregistered executor > 'thermos-nobody-devel-service-28854-0-6a88d62e-656 > 4-4e33-b0bb-1d8039d97afc' of framework > 20151001-085346-58917130-5050-37976-0000 at executor(1)@10.X.X.X:40541 > I0505 08:42:06.781446 21738 slave.cpp:2720] Killing un-reregistered executor > 'thermos-nobody-devel-service-23839-0-1d2cd0e6-699 > 4-4cba-a9df-3dfc1552667f' of framework > 20151001-085346-58917130-5050-37976-0000 at executor(1)@10.X.X.X:35757 > I0505 08:42:06.781466 21738 slave.cpp:2720] Killing un-reregistered executor > 'thermos-nobody-devel-service-29970-0-478a7291-d070-4aa8 > -af21-6fda889f750c' of framework 20151001-085346-58917130-5050-37976-0000 at > executor(1)@10.X.X.X:51463 > ... > I0505 08:42:06.781558 21738 slave.cpp:4230] Finished recovery > {code} > We have verified that the tasks and their executors are killed by the agent > during startup. When stopping the agent using supervisorctl stop, the > executors are still running (verified via {{ps aux}}). They are only killed > once the agent tries to reregister. > The issue is hard to reproduce: > * When restarting the agent binary multiple times, tasks are only lost for > the first restart. > * It is much more likely to occur if the agent binary has been running for a > longer period of time (> 7 days) > Mesos is correctly sticking to the 2 seconds wait time before killing > un-reregistered executors. The failed executors receive the reregistration > request, but it seems like they fail to send a reply. > A successful reregistration (not leading to LOST): > {code} > I0505 08:41:59.581231 21664 exec.cpp:456] Slave exited, but framework has > checkpointing enabled. Waiting 15mins to reconnect with slave > 20160118-141153-92471562-5050-6270-S17 > I0505 08:42:04.780591 21665 exec.cpp:256] Received reconnect request from > slave 20160118-141153-92471562-5050-6270-S17 > I0505 08:42:04.785297 21676 exec.cpp:233] Executor re-registered on slave > 20160118-141153-92471562-5050-6270-S17 > I0505 08:42:04.788579 21676 exec.cpp:245] Executor::reregistered took > 1.492339ms > {code} > A failed one: > {code} > I0505 08:42:04.779677 2389 exec.cpp:256] Received reconnect request from > slave 20160118-141153-92471562-5050-6270-S17 > E0505 08:42:05.481374 2408 process.cpp:1911] Failed to shutdown socket with > fd 11: Transport endpoint is not connected > I0505 08:42:05.481374 2395 exec.cpp:456] Slave exited, but framework has > checkpointing enabled. Waiting 15mins to reconnect with slave > 20160118-141153-92471562-5050-6270-S17 > {code} > All task ending up in LOST have an output similar to the one posted above, > i.e. log messages are in a wrong order. > Anyone an idea what might be going on here? -- This message was sent by Atlassian JIRA (v6.3.4#6332)