[
https://issues.apache.org/jira/browse/MESOS-1193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071032#comment-14071032
]
Steven Schlansker edited comment on MESOS-1193 at 7/22/14 10:45 PM:
--------------------------------------------------------------------
That sounds entirely plausible.
We are running Docker containers. Given the somewhat unstable state of Docker
support in Mesos, we are using our own Docker launching scripts. I had just
updated a base image so all the slaves were busy executing a 'docker pull' to
grab the new images.
Given that the task is a shell script that executes this pull, it may well be
past what Mesos thinks of as the "launch" phase. But it definitely was during
a lengthy initialization step.
It's worth noting that almost all of our jobs are Marathon tasks. I believe
the log messages about Chronos are unrelated, we only have one or two things
launching with it, and I don't think any were around the time of the crash.
was (Author: stevenschlansker):
That sounds entirely plausible.
We are running Docker containers. Given the somewhat unstable state of Docker
support in Mesos, we are using our own Docker launching scripts. I had just
updated a base image so all the slaves were busy executing a 'docker pull' to
grab the new images.
Given that the task is a shell script that executes this pull, it may well be
past what Mesos thinks of as the "launch" phase. But it definitely was during
a lengthy initialization step.
It's worth noting that almost all of our jobs are Marathon tasks. I believe
(?) the log messages about Chronos are unrelated, we only have one or two
things launching with it, and I don't think any were around the time of the
crash.
> Check failed: promises.contains(containerId) crashes slave
> ----------------------------------------------------------
>
> Key: MESOS-1193
> URL: https://issues.apache.org/jira/browse/MESOS-1193
> Project: Mesos
> Issue Type: Bug
> Components: containerization
> Affects Versions: 0.18.0
> Reporter: Tobi Knaup
>
> This was observed with four slaves on one machine, one framework (Marathon)
> and around 100 tasks per slave.
> I0404 17:58:58.298075 3939 mesos_containerizer.cpp:891] Executor for
> container '6d4de71c-a491-4544-afe0-afcbfa37094a' has exited
> I0404 17:58:58.298395 3938 slave.cpp:2047] Executor 'web_467-1396634277535'
> of framework 201404041625-3823062160-55371-22555-0000 has terminated with
> signal Killed
> E0404 17:58:58.298475 3929 slave.cpp:2320] Failed to unmonitor container for
> executor web_467-1396634277535 of framework
> 201404041625-3823062160-55371-22555-0000: Not monitored
> I0404 17:58:58.299075 3938 slave.cpp:1643] Handling status update
> TASK_FAILED (UUID: c815e057-e7a2-4c26-a382-6796a1585d1d) for task
> web_467-1396634277535 of framework 201404041625-3823062160-55371-22555-0000
> from @0.0.0.0:0
> I0404 17:58:58.299232 3932 status_update_manager.cpp:315] Received status
> update TASK_FAILED (UUID: c815e057-e7a2-4c26-a382-6796a1585d1d) for task
> web_467-1396634277535 of framework 201404041625-3823062160-55371-22555-0000
> I0404 17:58:58.299360 3932 status_update_manager.cpp:368] Forwarding status
> update TASK_FAILED (UUID: c815e057-e7a2-4c26-a382-6796a1585d1d) for task
> web_467-1396634277535 of framework 201404041625-3823062160-55371-22555-0000
> to [email protected]:5050
> I0404 17:58:58.306967 3932 status_update_manager.cpp:393] Received status
> update acknowledgement (UUID: c815e057-e7a2-4c26-a382-6796a1585d1d) for task
> web_467-1396634277535 of framework 201404041625-3823062160-55371-22555-0000
> I0404 17:58:58.307049 3932 slave.cpp:2186] Cleaning up executor
> 'web_467-1396634277535' of framework 201404041625-3823062160-55371-22555-0000
> I0404 17:58:58.307122 3932 gc.cpp:56] Scheduling
> '/tmp/mesos5053/slaves/20140404-164105-3823062160-5050-24762-5/frameworks/201404041625-3823062160-55371-22555-0000/executors/web_467-1396634277535/runs/6d4de71c-a491-4544-afe0-afcbfa37094a'
> for gc 6.99999644578667days in the future
> I0404 17:58:58.307157 3932 gc.cpp:56] Scheduling
> '/tmp/mesos5053/slaves/20140404-164105-3823062160-5050-24762-5/frameworks/201404041625-3823062160-55371-22555-0000/executors/web_467-1396634277535'
> for gc 6.99999644553185days in the future
> F0404 17:58:58.597434 3938 mesos_containerizer.cpp:682] Check failed:
> promises.contains(containerId)
> *** Check failure stack trace: ***
> @ 0x7f5209da6e5d google::LogMessage::Fail()
> @ 0x7f5209da8c9d google::LogMessage::SendToLog()
> @ 0x7f5209da6a4c google::LogMessage::Flush()
> @ 0x7f5209da9599 google::LogMessageFatal::~LogMessageFatal()
> @ 0x7f5209ad9f88
> mesos::internal::slave::MesosContainerizerProcess::exec()
> @ 0x7f5209af3b56
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchI7NothingN5mesos8internal5slave25MesosContainerizerProcessERKNS6_11ContainerIDEiSA_iEENS0_6FutureIT_EERKNS0_3PIDIT0_EEMSH_FSF_T1_T2_ET3_T4_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
> @ 0x7f5209cd0bf2 process::ProcessManager::resume()
> @ 0x7f5209cd0eec process::schedule()
> @ 0x7f5208b48f6e start_thread
> @ 0x7f52088739cd (unknown)
--
This message was sent by Atlassian JIRA
(v6.2#6252)