Re: Question about TASK_LOST statuses

Benjamin Mahler Mon, 03 Jun 2013 13:50:53 -0700

Yes, the Python bindings are still supported.

Can you dump the DebugString of the TaskInfo you're constructing, to
confirm the SlaveID looks ok?


Ben


On Tue, May 28, 2013 at 7:06 AM, David Greenberg <[email protected]>wrote:

> Sorry for the delayed response--I'm having some issues w/ email delivery to
> gmail...
>
> I'm trying to use the Python binding in this application. I am copying from
> offer.slave_id.value to task.slave_id.value using the = operator.
>
> Is the python binding still supported? Either way, due to some new
> concurrency requirements, I'm going to be shifting gears into writing a
> JVM-based Mesos framework now.
>
> Thanks!
>
>
> On Thu, May 23, 2013 at 1:02 PM, Vinod Kone <[email protected]> wrote:
>
> > ---------- Forwarded message ----------
> > From: Vinod Kone <[email protected]>
> > Date: Sun, May 19, 2013 at 6:56 PM
> > Subject: Re: Question about TASK_LOST statuses
> > To: "[email protected]" <[email protected]>
> >
> >
> > On the master's logs, I see this:
> >
> > > - 5600+ instances of "Error validating task XXX: Task uses invalid
> slave:
> > > SOME_UUID"
> > >
> > What do you think the problem is? I am copying the slave_id from the
> offer
> > > into the TaskInfo protobuf.
> > >
> > >
> > This will happen if the slave id in the task doesn't match the slave id
> in
> > the slave. Are you sure you are doing the copying the right slave ids to
> > the right tasks? Looks like there is a mismatch. Maybe some logs/printfs
> on
> > your scheduler, when you launch tasks, can point out the issue.
> >
> >
> >
> > > I'm using the process-based isolation at the moment (I haven't had the
> > time
> > > to set up the cgroups isolation yet).
> > >
> > > I can find and share whatever else is needed so that we can figure out
> > why
> > > these messages are occurring.
> > >
> > > Thanks,
> > > David
> > >
> > >
> > > On Fri, May 17, 2013 at 5:16 PM, Vinod Kone <[email protected]>
> wrote:
> > >
> > > > Hi David,
> > > >
> > > > You are right in that all these status updates are what we call
> > > "terminal"
> > > > status updates and mesos takes specific actions when it
> gets/generates
> > > one
> > > > of these.
> > > >
> > > > TASK_LOST is special in the sense that is not generated by the
> > executor,
> > > > but by the slave/master. You could think of it as an exception in
> > mesos.
> > > > Clearly, these should be rare in a stable mesos system.
> > > >
> > > > What do your logs say about the TASK_LOSTs? Is it always the same
> > issue?
> > > > Are you running w/ cgroups?
> > > >
> > > >
> > > >
> > > > On Fri, May 17, 2013 at 2:04 PM, David Greenberg <
> > [email protected]
> > > > >wrote:
> > > >
> > > > > Hello! Today I began working on a more advanced version of
> > mesos-submit
> > > > > that will handle hot-spares.
> > > > >
> > > > > I was assuming that TASK_{FAILED,FINISHED,LOST,KILLED} were the
> > status
> > > > > updates that meant that I needed to start a new spare process, as
> the
> > > > > monitored task was killed. However, I noticed that I often recieved
> > > > > TASK_LOSTs, and every 5 seconds, my scheduler would think its tasks
> > had
> > > > all
> > > > > died, so it'd restart too many. Nevertheless, the tasks would
> > reappear
> > > > > later on, and I could see them in the web interface of Mesos,
> > > continuing
> > > > to
> > > > > run.
> > > > >
> > > > > What is going on?
> > > > >
> > > > > Thanks!
> > > > > David
> > > > >
> > > >
> > >
> >
>

Re: Question about TASK_LOST statuses

Reply via email to