Re: Question about TASK_LOST statuses

Vinod Kone Tue, 11 Jun 2013 13:54:34 -0700

What version of mesos are you running? Some logs and command lines would be
great to debug here.



On Tue, Jun 11, 2013 at 1:48 PM, David Greenberg <[email protected]>wrote:

> So, I got a new, fresh Mac that I installed Mesos, Zookeeper, and other
> local processes on.
>
> When I test the code locally on the Mac (which has intermittant network
> connectivity), it runs fine for several minutes, then crashes OSX (the
> machine hardlocks), which is strange because I don't observe CPU or memory
> spikes, or network events (which I'm logging at 1s intervals).
>
> When I run the code on the cluster (which is Ubuntu Linux based), I still
> see a huge number of TASK_LOST messages, and the framework fails to have
> any tasks successfully run.
>
> What do you think the next steps are to debug this? Could it be a lossy
> network, or a misconfiguration of the slaves or the master or zookeeper?
>
> Thank you!
>
>
> On Mon, Jun 3, 2013 at 4:48 PM, Benjamin Mahler
> <[email protected]>wrote:
>
> > Yes, the Python bindings are still supported.
> >
> > Can you dump the DebugString of the TaskInfo you're constructing, to
> > confirm the SlaveID looks ok?
> >
> > Ben
> >
> >
> > On Tue, May 28, 2013 at 7:06 AM, David Greenberg <[email protected]
> > >wrote:
> >
> > > Sorry for the delayed response--I'm having some issues w/ email
> delivery
> > to
> > > gmail...
> > >
> > > I'm trying to use the Python binding in this application. I am copying
> > from
> > > offer.slave_id.value to task.slave_id.value using the = operator.
> > >
> > > Is the python binding still supported? Either way, due to some new
> > > concurrency requirements, I'm going to be shifting gears into writing a
> > > JVM-based Mesos framework now.
> > >
> > > Thanks!
> > >
> > >
> > > On Thu, May 23, 2013 at 1:02 PM, Vinod Kone <[email protected]>
> wrote:
> > >
> > > > ---------- Forwarded message ----------
> > > > From: Vinod Kone <[email protected]>
> > > > Date: Sun, May 19, 2013 at 6:56 PM
> > > > Subject: Re: Question about TASK_LOST statuses
> > > > To: "[email protected]" <[email protected]
> >
> > > >
> > > >
> > > > On the master's logs, I see this:
> > > >
> > > > > - 5600+ instances of "Error validating task XXX: Task uses invalid
> > > slave:
> > > > > SOME_UUID"
> > > > >
> > > > What do you think the problem is? I am copying the slave_id from the
> > > offer
> > > > > into the TaskInfo protobuf.
> > > > >
> > > > >
> > > > This will happen if the slave id in the task doesn't match the slave
> id
> > > in
> > > > the slave. Are you sure you are doing the copying the right slave ids
> > to
> > > > the right tasks? Looks like there is a mismatch. Maybe some
> > logs/printfs
> > > on
> > > > your scheduler, when you launch tasks, can point out the issue.
> > > >
> > > >
> > > >
> > > > > I'm using the process-based isolation at the moment (I haven't had
> > the
> > > > time
> > > > > to set up the cgroups isolation yet).
> > > > >
> > > > > I can find and share whatever else is needed so that we can figure
> > out
> > > > why
> > > > > these messages are occurring.
> > > > >
> > > > > Thanks,
> > > > > David
> > > > >
> > > > >
> > > > > On Fri, May 17, 2013 at 5:16 PM, Vinod Kone <[email protected]>
> > > wrote:
> > > > >
> > > > > > Hi David,
> > > > > >
> > > > > > You are right in that all these status updates are what we call
> > > > > "terminal"
> > > > > > status updates and mesos takes specific actions when it
> > > gets/generates
> > > > > one
> > > > > > of these.
> > > > > >
> > > > > > TASK_LOST is special in the sense that is not generated by the
> > > > executor,
> > > > > > but by the slave/master. You could think of it as an exception in
> > > > mesos.
> > > > > > Clearly, these should be rare in a stable mesos system.
> > > > > >
> > > > > > What do your logs say about the TASK_LOSTs? Is it always the same
> > > > issue?
> > > > > > Are you running w/ cgroups?
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Fri, May 17, 2013 at 2:04 PM, David Greenberg <
> > > > [email protected]
> > > > > > >wrote:
> > > > > >
> > > > > > > Hello! Today I began working on a more advanced version of
> > > > mesos-submit
> > > > > > > that will handle hot-spares.
> > > > > > >
> > > > > > > I was assuming that TASK_{FAILED,FINISHED,LOST,KILLED} were the
> > > > status
> > > > > > > updates that meant that I needed to start a new spare process,
> as
> > > the
> > > > > > > monitored task was killed. However, I noticed that I often
> > recieved
> > > > > > > TASK_LOSTs, and every 5 seconds, my scheduler would think its
> > tasks
> > > > had
> > > > > > all
> > > > > > > died, so it'd restart too many. Nevertheless, the tasks would
> > > > reappear
> > > > > > > later on, and I could see them in the web interface of Mesos,
> > > > > continuing
> > > > > > to
> > > > > > > run.
> > > > > > >
> > > > > > > What is going on?
> > > > > > >
> > > > > > > Thanks!
> > > > > > > David
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Question about TASK_LOST statuses

Reply via email to