[ 
https://issues.apache.org/jira/browse/MESOS-2334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Raster updated MESOS-2334:
----------------------------------
    Attachment: testcase.tar.gz

Here is a testcase that should trigger the issue.

We also have a potential cause for the problem: the logging level. We set up 
our test cluster according to your tutorial, which had us put this line in our 
initscripts:
GLOG_v=2

setting the highest possible verbosity on the slaves. After running into a 
problem where a hard disk would fill up because of too much logging, we changed 
that to:
GLOG_minloglevel=1

and that seems to have solved our issue. Now we suspect that the issue is 
somehow related to io in general. Maybe some race condition somewhere that only 
really appears when the loglevel is high? We also thought the hard disk filling 
up might be a reason, but I convinced the others that this probably is not the 
case because it happened in my local vagrant without the hard disk filling up.

To use the testcase keep in mind that you have to change the hardcoded 
zookeeper url.

> Tasks get stuck in TASK_STAGING after a network decode error
> ------------------------------------------------------------
>
>                 Key: MESOS-2334
>                 URL: https://issues.apache.org/jira/browse/MESOS-2334
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 0.21.0, 0.21.1
>            Reporter: Andreas Raster
>         Attachments: testcase.tar.gz
>
>
> We observed that with a test case that schedules a large amount of small 
> CommandInfo tasks (shell commands that look like this: "sleep `shuf -i 2-3 -n 
> 1`; echo foo >> /share/bar") on a cluster with launchTasks, that sometimes we 
> would get an issue where a single task that has been launched and was set to 
> TASK_STAGING would never receive a TASK_RUNNING message (or any other message 
> at all). So it would then just stay in TASK_STAGING infinitely until we would 
> kill the framework.
> We asked in #mesos on freenode about this and got an answer from alexr_:
> [15:56:55] <alexr_> henno: thanks for the slave logs
> [15:57:09] rakete [[email protected]] has 
> left #mesos
> [15:58:47] <alexr_> henno: it looks from the logs, that the slave 
> successfully registers the executor and sends the task
> [15:59:07] tillt_ [[email protected]] has joined #mesos
> [15:59:30] <alexr_> the executor, for some reason, refuses to start the task, 
> most probably because of the message decoding error
> telling us that he suspects the reason is a network decoding error. I am 
> currently not 100% sure what he means by that and I wasn't the guy talking to 
> alexr_ on irc so I cannot post the exact log section that indicates that 
> decoding error. But I'll attach the logs that we supplied to alexr_, so those 
> should contain the relevant information.
> The tasks name in question was: 727527fc-a3f3-418d-a44e-ec3bbdd26315
> cat /var/log/mesos/mesos-slave.INFO | grep 
> 727527fc-a3f3-418d-a44e-ec3bbdd26315
> >> http://paste.ubuntu.com/10160270/
> cat 
> /tmp/mesos/slaves/20141217-133241-2867204268-5050-12776-S1/frameworks/20150209-153125-2867204268-5050-2553-0025/executors/29cde3b3-994a-4480-b10e-c49b4fc6c706+0+727527fc-a3f3-418d-a44e-ec3bbdd26315/runs/d73a76e7-aec6-4760-bd48-86b79df89d52/stderr
>  
> >> http://paste.ubuntu.com/10160335/
> cat 
> /tmp/mesos/slaves/20141217-133241-2867204268-5050-12776-S1/frameworks/20150209-153125-2867204268-5050-2553-0025/executors/29cde3b3-994a-4480-b10e-c49b4fc6c706+0+727527fc-a3f3-418d-a44e-ec3bbdd26315/runs/d73a76e7-aec6-4760-bd48-86b79df89d52/stdout
> >> http://paste.ubuntu.com/10160346/
> Now, if some relevant information is still missing, don't hesitate to ask me.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to