[
https://issues.apache.org/jira/browse/MESOS-2334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andreas Raster updated MESOS-2334:
----------------------------------
Affects Version/s: 0.21.1
We tested this issue again with 0.21.1 and verified that it does occur in that
version as well.
Our testcase that triggers this issue randomly kills tasks that are running,
and we suspect that this killing might be a neccessary condition for the issue
to happen, and maybe even cause it. We have so far never seen this issue happen
in testcases that are not killing tasks.
We are trying to prepare a very simple testcase (without our whole source tree
as dependency) to isolate the issue and verify that it is indeed a bug on your
side and not on ours. When we manage to do that and reproduce the issue with a
simpler testcase we'll attach it here.
> Tasks get stuck in TASK_STAGING after a network decode error
> ------------------------------------------------------------
>
> Key: MESOS-2334
> URL: https://issues.apache.org/jira/browse/MESOS-2334
> Project: Mesos
> Issue Type: Bug
> Affects Versions: 0.21.0, 0.21.1
> Reporter: Andreas Raster
>
> We observed that with a test case that schedules a large amount of small
> CommandInfo tasks (shell commands that look like this: "sleep `shuf -i 2-3 -n
> 1`; echo foo >> /share/bar") on a cluster with launchTasks, that sometimes we
> would get an issue where a single task that has been launched and was set to
> TASK_STAGING would never receive a TASK_RUNNING message (or any other message
> at all). So it would then just stay in TASK_STAGING infinitely until we would
> kill the framework.
> We asked in #mesos on freenode about this and got an answer from alexr_:
> [15:56:55] <alexr_> henno: thanks for the slave logs
> [15:57:09] rakete [[email protected]] has
> left #mesos
> [15:58:47] <alexr_> henno: it looks from the logs, that the slave
> successfully registers the executor and sends the task
> [15:59:07] tillt_ [[email protected]] has joined #mesos
> [15:59:30] <alexr_> the executor, for some reason, refuses to start the task,
> most probably because of the message decoding error
> telling us that he suspects the reason is a network decoding error. I am
> currently not 100% sure what he means by that and I wasn't the guy talking to
> alexr_ on irc so I cannot post the exact log section that indicates that
> decoding error. But I'll attach the logs that we supplied to alexr_, so those
> should contain the relevant information.
> The tasks name in question was: 727527fc-a3f3-418d-a44e-ec3bbdd26315
> cat /var/log/mesos/mesos-slave.INFO | grep
> 727527fc-a3f3-418d-a44e-ec3bbdd26315
> >> http://paste.ubuntu.com/10160270/
> cat
> /tmp/mesos/slaves/20141217-133241-2867204268-5050-12776-S1/frameworks/20150209-153125-2867204268-5050-2553-0025/executors/29cde3b3-994a-4480-b10e-c49b4fc6c706+0+727527fc-a3f3-418d-a44e-ec3bbdd26315/runs/d73a76e7-aec6-4760-bd48-86b79df89d52/stderr
>
> >> http://paste.ubuntu.com/10160335/
> cat
> /tmp/mesos/slaves/20141217-133241-2867204268-5050-12776-S1/frameworks/20150209-153125-2867204268-5050-2553-0025/executors/29cde3b3-994a-4480-b10e-c49b4fc6c706+0+727527fc-a3f3-418d-a44e-ec3bbdd26315/runs/d73a76e7-aec6-4760-bd48-86b79df89d52/stdout
> >> http://paste.ubuntu.com/10160346/
> Now, if some relevant information is still missing, don't hesitate to ask me.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)