[
https://issues.apache.org/jira/browse/MESOS-5395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15291648#comment-15291648
]
Joseph Wu commented on MESOS-5395:
----------------------------------
Nothing in the mesos logs indicates that your task is *not* starting:
>From the stdout file, the task you're looking at is
{code}
project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e
{code}
The agent logs say that the task started successfully. These timestamps lines
up very closely with the task's stderr.
{code}
I0518 14:55:19.393923 947 slave.cpp:1361] Got assigned task
project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e for
framework 17cd3756-1d59-4dfc-984d-3fe09f6b5730-0000
I0518 14:55:19.394619 947 gc.cpp:83] Unscheduling
'/var/mesos/slaves/282745ab-423a-4350-a449-3e8cdfccfb93-S2/frameworks/17cd3756-1d59-4dfc-984d-3fe09f6b5730-0000'
from gc
I0518 14:55:19.394680 947 gc.cpp:83] Unscheduling
'/var/mesos/meta/slaves/282745ab-423a-4350-a449-3e8cdfccfb93-S2/frameworks/17cd3756-1d59-4dfc-984d-3fe09f6b5730-0000'
from gc
I0518 14:55:19.394760 947 slave.cpp:1480] Launching task
project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e for
framework 17cd3756-1d59-4dfc-984d-3fe09f6b5730-0000
I0518 14:55:19.395539 947 paths.cpp:528] Trying to chown
'/var/mesos/slaves/282745ab-423a-4350-a449-3e8cdfccfb93-S2/frameworks/17cd3756-1d59-4dfc-984d-3fe09f6b5730-0000/executors/project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e/runs/d3996d05-26f6-4e6c-a89f-8ee9c617182c'
to user 'root'
I0518 14:55:19.399237 947 slave.cpp:5367] Launching executor
project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e of
framework 17cd3756-1d59-4dfc-984d-3fe09f6b5730-0000 with resources cpus(*):0.1;
mem(*):32 in work directory
'/var/mesos/slaves/282745ab-423a-4350-a449-3e8cdfccfb93-S2/frameworks/17cd3756-1d59-4dfc-984d-3fe09f6b5730-0000/executors/project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e/runs/d3996d05-26f6-4e6c-a89f-8ee9c617182c'
I0518 14:55:19.399588 947 slave.cpp:1698] Queuing task
'project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e' for
executor
'project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e' of
framework 17cd3756-1d59-4dfc-984d-3fe09f6b5730-0000
I0518 14:55:19.402344 948 docker.cpp:1036] Starting container
'd3996d05-26f6-4e6c-a89f-8ee9c617182c' for task
'project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e' (and
executor
'project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e') of
framework '17cd3756-1d59-4dfc-984d-3fe09f6b5730-0000'
...
I0518 14:55:26.880151 952 docker.cpp:623] Checkpointing pid 6331 to
'/var/mesos/meta/slaves/282745ab-423a-4350-a449-3e8cdfccfb93-S2/frameworks/17cd3756-1d59-4dfc-984d-3fe09f6b5730-0000/executors/project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e/runs/d3996d05-26f6-4e6c-a89f-8ee9c617182c/pids/forked.pid'
I0518 14:55:26.907119 952 slave.cpp:2643] Got registration for executor
'project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e' of
framework 17cd3756-1d59-4dfc-984d-3fe09f6b5730-0000 from
executor(1)@10.254.234.236:42289
I0518 14:55:26.907639 952 docker.cpp:1316] Ignoring updating container
'd3996d05-26f6-4e6c-a89f-8ee9c617182c' with resources passed to update is
identical to existing resources
I0518 14:55:26.907726 952 slave.cpp:1863] Sending queued task
'project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e' to
executor
'project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e' of
framework 17cd3756-1d59-4dfc-984d-3fe09f6b5730-0000 at
executor(1)@10.254.234.236:42289
I0518 14:55:27.622561 952 slave.cpp:3002] Handling status update TASK_RUNNING
(UUID: 26e73671-099c-49f1-a031-57aa9a8cec41) for task
project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e of
framework 17cd3756-1d59-4dfc-984d-3fe09f6b5730-0000 from
executor(1)@10.254.234.236:42289
I0518 14:55:27.622762 953 status_update_manager.cpp:320] Received status
update TASK_RUNNING (UUID: 26e73671-099c-49f1-a031-57aa9a8cec41) for task
project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e of
framework 17cd3756-1d59-4dfc-984d-3fe09f6b5730-0000
I0518 14:55:27.622974 953 status_update_manager.cpp:824] Checkpointing UPDATE
for status update TASK_RUNNING (UUID: 26e73671-099c-49f1-a031-57aa9a8cec41) for
task project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e of
framework 17cd3756-1d59-4dfc-984d-3fe09f6b5730-0000
I0518 14:55:27.679003 953 slave.cpp:3400] Forwarding the update TASK_RUNNING
(UUID: 26e73671-099c-49f1-a031-57aa9a8cec41) for task
project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e of
framework 17cd3756-1d59-4dfc-984d-3fe09f6b5730-0000 to
[email protected]:5050
I0518 14:55:27.679095 953 slave.cpp:3310] Sending acknowledgement for status
update TASK_RUNNING (UUID: 26e73671-099c-49f1-a031-57aa9a8cec41) for task
project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e of
framework 17cd3756-1d59-4dfc-984d-3fe09f6b5730-0000 to
executor(1)@10.254.234.236:42289
I0518 14:55:27.691797 950 status_update_manager.cpp:392] Received status
update acknowledgement (UUID: 26e73671-099c-49f1-a031-57aa9a8cec41) for task
project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e of
framework 17cd3756-1d59-4dfc-984d-3fe09f6b5730-0000
I0518 14:55:27.691839 950 status_update_manager.cpp:824] Checkpointing ACK
for status update TASK_RUNNING (UUID: 26e73671-099c-49f1-a031-57aa9a8cec41) for
task project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e of
framework 17cd3756-1d59-4dfc-984d-3fe09f6b5730-0000
{code}
Right above this, is presumably marathon's previous attempt at starting your
task.
{code}
I0518 11:56:24.553864 947 docker.cpp:1036] Starting container
'a2227cc9-79aa-417c-8189-a260e8b57b2b' for task
'project-hub_project-hub-frontend.663dbd31-1cd6-11e6-bb25-d00d2cce797e' (and
executor
'project-hub_project-hub-frontend.663dbd31-1cd6-11e6-bb25-d00d2cce797e') of
framework '17cd3756-1d59-4dfc-984d-3fe09f6b5730-0000'
I0518 12:01:24.554524 948 slave.cpp:4322] Terminating executor
''project-hub_project-hub-frontend.663dbd31-1cd6-11e6-bb25-d00d2cce797e' of
framework 17cd3756-1d59-4dfc-984d-3fe09f6b5730-0000' because it did not
register within 5mins
I0518 12:01:24.554687 948 docker.cpp:1696] Destroying container
'a2227cc9-79aa-417c-8189-a260e8b57b2b'
I0518 12:01:24.554694 948 docker.cpp:1739] Destroying Container
'a2227cc9-79aa-417c-8189-a260e8b57b2b' in PULLING state
{code}
By the looks of it, your docker image is either very large (i.e. it cannot be
reliably pulled within 5 minutes, or the agent's
{{--executor_registration_timeout}} flag); or that agent was partitioned from
the docker registry you are using.
If your image(s) are very large, consider increasing the value of the
{{--executor_registration_timeout}} flag.
> Task getting stuck in staging state if launch it on a rebooted slave.
> ---------------------------------------------------------------------
>
> Key: MESOS-5395
> URL: https://issues.apache.org/jira/browse/MESOS-5395
> Project: Mesos
> Issue Type: Bug
> Affects Versions: 0.28.0
> Environment: mesos/marathon cluster, 3 maters/4 slaves
> Mesos: 0.28.0 , Marathon 0.15.2
> Reporter: Mengkui gong
> Attachments: mesos-log.zip
>
>
> if rebooting a slave, after that, using Marathon to launch a task, the task
> can start on other slaves without problem. But if launch it on the rebooted
> slave, the task will be stuck. From Mesos UI shows it in staging state from
> active tasks list. From Marathon UI shows it in deploying state. It can
> keeping in stuck state for more than 2 hours. After that time, Marathon will
> automatically launch the task on this rebooted slave or other slave as
> normal. So the rebooted slave be recovered as well after that time.
> From Mesos log, I can see "telling slave to kill task" all the time.
> I0517 15:25:27.207237 20568 master.cpp:3826] Telling slave
> 282745ab-423a-4350-a449-3e8cdfccfb93-S1 at slave(1)@10.254.234.236:5050
> (mesos-slave-3) to kill task
> project-hub_project-hub-frontend.b645f24b-1c1f-11e6-bb25-d00d2cce797e of
> framework 17cd3756-1d59-4dfc-984d-3fe09f6b5730-0000 (marathon) at
> [email protected]:56757.
> From rebooted slave log, I can see:
> May 17 15:28:37 euca-10-254-234-236 mesos-slave[829]: I0517 15:28:37.206831
> 916 slave.cpp:1891] Asked to kill task
> project-hub_project-hub-frontend.b645f24b-1c1f-11e6-bb25-d00d2cce797e of
> framework 17cd3756-1d59-4dfc-984d-3fe09f6b5730-0000
> May 17 15:28:37 euca-10-254-234-236 mesos-slave[829]: W0517 15:28:37.206866
> 916 slave.cpp:2018] Ignoring kill task
> project-hub_project-hub-frontend.b645f24b-1c1f-11e6-bb25-d00d2cce797e because
> the executor
> 'project-hub_project-hub-frontend.b645f24b-1c1f-11e6-bb25-d00d2cce797e' of
> framework 17cd3756-1d59-4dfc-984d-3fe09f6b5730-0000 is terminating/terminated.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)