Re: Question about TASK_LOST statuses

David Greenberg Tue, 28 May 2013 07:07:00 -0700

Sorry for the delayed response--I'm having some issues w/ email delivery to
gmail...


I'm trying to use the Python binding in this application. I am copying from
offer.slave_id.value to task.slave_id.value using the = operator.

Is the python binding still supported? Either way, due to some new
concurrency requirements, I'm going to be shifting gears into writing a
JVM-based Mesos framework now.

Thanks!


On Thu, May 23, 2013 at 1:02 PM, Vinod Kone <[email protected]> wrote:

> ---------- Forwarded message ----------
> From: Vinod Kone <[email protected]>
> Date: Sun, May 19, 2013 at 6:56 PM
> Subject: Re: Question about TASK_LOST statuses
> To: "[email protected]" <[email protected]>
>
>
> On the master's logs, I see this:
>
> > - 5600+ instances of "Error validating task XXX: Task uses invalid slave:
> > SOME_UUID"
> >
> What do you think the problem is? I am copying the slave_id from the offer
> > into the TaskInfo protobuf.
> >
> >
> This will happen if the slave id in the task doesn't match the slave id in
> the slave. Are you sure you are doing the copying the right slave ids to
> the right tasks? Looks like there is a mismatch. Maybe some logs/printfs on
> your scheduler, when you launch tasks, can point out the issue.
>
>
>
> > I'm using the process-based isolation at the moment (I haven't had the
> time
> > to set up the cgroups isolation yet).
> >
> > I can find and share whatever else is needed so that we can figure out
> why
> > these messages are occurring.
> >
> > Thanks,
> > David
> >
> >
> > On Fri, May 17, 2013 at 5:16 PM, Vinod Kone <[email protected]> wrote:
> >
> > > Hi David,
> > >
> > > You are right in that all these status updates are what we call
> > "terminal"
> > > status updates and mesos takes specific actions when it gets/generates
> > one
> > > of these.
> > >
> > > TASK_LOST is special in the sense that is not generated by the
> executor,
> > > but by the slave/master. You could think of it as an exception in
> mesos.
> > > Clearly, these should be rare in a stable mesos system.
> > >
> > > What do your logs say about the TASK_LOSTs? Is it always the same
> issue?
> > > Are you running w/ cgroups?
> > >
> > >
> > >
> > > On Fri, May 17, 2013 at 2:04 PM, David Greenberg <
> [email protected]
> > > >wrote:
> > >
> > > > Hello! Today I began working on a more advanced version of
> mesos-submit
> > > > that will handle hot-spares.
> > > >
> > > > I was assuming that TASK_{FAILED,FINISHED,LOST,KILLED} were the
> status
> > > > updates that meant that I needed to start a new spare process, as the
> > > > monitored task was killed. However, I noticed that I often recieved
> > > > TASK_LOSTs, and every 5 seconds, my scheduler would think its tasks
> had
> > > all
> > > > died, so it'd restart too many. Nevertheless, the tasks would
> reappear
> > > > later on, and I could see them in the web interface of Mesos,
> > continuing
> > > to
> > > > run.
> > > >
> > > > What is going on?
> > > >
> > > > Thanks!
> > > > David
> > > >
> > >
> >
>

Re: Question about TASK_LOST statuses

Reply via email to