Andreas Raster created MESOS-2334:
-------------------------------------
Summary: Tasks get stuck in TASK_STAGING after a network decode
error
Key: MESOS-2334
URL: https://issues.apache.org/jira/browse/MESOS-2334
Project: Mesos
Issue Type: Bug
Reporter: Andreas Raster
We observed that with a test case that schedules a large amount of small
CommandInfo tasks (shell commands that look like this: "sleep `shuf -i 2-3 -n
1`; echo foo >> /share/bar") on a cluster with launchTasks, that sometimes we
would get an issue where a single task that has been launched and was set to
TASK_STAGING would never receive a TASK_RUNNING message (or any other message
at all). So it would then just stay in TASK_STAGING infinitely until we would
kill the framework.
We asked in #mesos on freenode about this and got an answer from alexr_:
[15:56:55] <alexr_> henno: thanks for the slave logs
[15:57:09] rakete [[email protected]] has left
#mesos
[15:58:47] <alexr_> henno: it looks from the logs, that the slave successfully
registers the executor and sends the task
[15:59:07] tillt_ [[email protected]] has joined #mesos
[15:59:30] <alexr_> the executor, for some reason, refuses to start the task,
most probably because of the message decoding error
telling us that he suspects the reason is a network decoding error. I am
currently not 100% sure what he means by that and I wasn't the guy talking to
alexr_ on irc so I cannot post the exact log section that indicates that
decoding error. But I'll attach the logs that we supplied to alexr_, so those
should contain the relevant information.
The tasks name in question was: 727527fc-a3f3-418d-a44e-ec3bbdd26315
cat /var/log/mesos/mesos-slave.INFO | grep 727527fc-a3f3-418d-a44e-ec3bbdd26315
>> http://paste.ubuntu.com/10160270/
cat
/tmp/mesos/slaves/20141217-133241-2867204268-5050-12776-S1/frameworks/20150209-153125-2867204268-5050-2553-0025/executors/29cde3b3-994a-4480-b10e-c49b4fc6c706+0+727527fc-a3f3-418d-a44e-ec3bbdd26315/runs/d73a76e7-aec6-4760-bd48-86b79df89d52/stderr
>> http://paste.ubuntu.com/10160335/
cat
/tmp/mesos/slaves/20141217-133241-2867204268-5050-12776-S1/frameworks/20150209-153125-2867204268-5050-2553-0025/executors/29cde3b3-994a-4480-b10e-c49b4fc6c706+0+727527fc-a3f3-418d-a44e-ec3bbdd26315/runs/d73a76e7-aec6-4760-bd48-86b79df89d52/stdout
>> http://paste.ubuntu.com/10160346/
Now, if some relevant information is still missing, don't hesitate to ask me.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)