Re: Question about TASK_LOST statuses

Benjamin Mahler Wed, 12 Jun 2013 15:30:35 -0700

Can you link to the logs?

Can you give us a little background about how you're using mesos? If you're
using it for production jobs, I would recommend 0.12.0 once released as it
has been vetted in production (at Twitter at least). We've also included
instructions on how to upgrade from 0.11.0 to 0.12.0 on a running cluster.



On Wed, Jun 12, 2013 at 7:07 AM, David Greenberg <[email protected]>wrote:

> I am on 0.12 right now, git revision
> 3758114ee4492dcbb784d5aac65d43ac54ddb439 (same as airbnb/chronos
> recomends).
>
> I've the master and slave logs are 1.7MB bz2'ed, but apache.org's mailer
> doesn't accept such large messages. I've sent them directly to VInod, and I
> can send them to anyone else who asks.
>
> I'm just running mesos w/ --conf, and the config is
>
> master = zk://iadv1.pit.mycompany.com:2181,iadv2.pit.mycompany.com:2181,
> iadv3.pit.mycompany.com:2181,iadv4.pit.mycompany.com:2181,
> iadv5.pit.mycompany.com:2181/mesos
> zk = zk://iadv1.pit.mycompany.com:2181,iadv2.pit.mycompany.com:2181,
> iadv3.pit.mycompany.com:2181,iadv4.pit.mycompany.com:2181,
> iadv5.pit.mycompany.com:2181/mesos
> log_dir = /data/scratch/local/mesos/logs
> work_dir = /data/scratch/local/mesos/work
>
>
> I would be happy to move to the latest version that's likely stable, but
> even after reading all of the discussion over the past couple weeks on
> 0.11, 0.12, and 0.13, I have no idea whether I should pick one of those,
> HEAD, or some other commit.
>
> Thank you!
>
>
> On Wed, Jun 12, 2013 at 10:01 AM, David Greenberg <[email protected]
> >wrote:
>
> > I am on 0.12 right now, git revision
> > 3758114ee4492dcbb784d5aac65d43ac54ddb439 (same as airbnb/chronos
> recomends).
> >
> > I've attached the master and slave logs. I'm just running mesos w/
> --conf,
> > and the config is
> >
> > master = zk://iadv1.pit.mycompany.com:2181,iadv2.pit.mycompany.com:2181,
> > iadv3.pit.mycompany.com:2181,iadv4.pit.mycompany.com:2181,
> > iadv5.pit.mycompany.com:2181/mesos
> > zk = zk://iadv1.pit.mycompany.com:2181,iadv2.pit.mycompany.com:2181,
> > iadv3.pit.mycompany.com:2181,iadv4.pit.mycompany.com:2181,
> > iadv5.pit.mycompany.com:2181/mesos
> > log_dir = /data/scratch/local/mesos/logs
> > work_dir = /data/scratch/local/mesos/work
> >
> >
> > I would be happy to move to the latest version that's likely stable, but
> > even after reading all of the discussion over the past couple weeks on
> > 0.11, 0.12, and 0.13, I have no idea whether I should pick one of those,
> > HEAD, or some other commit.
> >
> > Thank you!
> >
> > On Tue, Jun 11, 2013 at 4:53 PM, Vinod Kone <[email protected]> wrote:
> >
> >> What version of mesos are you running? Some logs and command lines would
> >> be
> >> great to debug here.
> >>
> >>
> >> On Tue, Jun 11, 2013 at 1:48 PM, David Greenberg <
> [email protected]
> >> >wrote:
> >>
> >> > So, I got a new, fresh Mac that I installed Mesos, Zookeeper, and
> other
> >> > local processes on.
> >> >
> >> > When I test the code locally on the Mac (which has intermittant
> network
> >> > connectivity), it runs fine for several minutes, then crashes OSX (the
> >> > machine hardlocks), which is strange because I don't observe CPU or
> >> memory
> >> > spikes, or network events (which I'm logging at 1s intervals).
> >> >
> >> > When I run the code on the cluster (which is Ubuntu Linux based), I
> >> still
> >> > see a huge number of TASK_LOST messages, and the framework fails to
> have
> >> > any tasks successfully run.
> >> >
> >> > What do you think the next steps are to debug this? Could it be a
> lossy
> >> > network, or a misconfiguration of the slaves or the master or
> zookeeper?
> >> >
> >> > Thank you!
> >> >
> >> >
> >> > On Mon, Jun 3, 2013 at 4:48 PM, Benjamin Mahler
> >> > <[email protected]>wrote:
> >> >
> >> > > Yes, the Python bindings are still supported.
> >> > >
> >> > > Can you dump the DebugString of the TaskInfo you're constructing, to
> >> > > confirm the SlaveID looks ok?
> >> > >
> >> > > Ben
> >> > >
> >> > >
> >> > > On Tue, May 28, 2013 at 7:06 AM, David Greenberg <
> >> [email protected]
> >> > > >wrote:
> >> > >
> >> > > > Sorry for the delayed response--I'm having some issues w/ email
> >> > delivery
> >> > > to
> >> > > > gmail...
> >> > > >
> >> > > > I'm trying to use the Python binding in this application. I am
> >> copying
> >> > > from
> >> > > > offer.slave_id.value to task.slave_id.value using the = operator.
> >> > > >
> >> > > > Is the python binding still supported? Either way, due to some new
> >> > > > concurrency requirements, I'm going to be shifting gears into
> >> writing a
> >> > > > JVM-based Mesos framework now.
> >> > > >
> >> > > > Thanks!
> >> > > >
> >> > > >
> >> > > > On Thu, May 23, 2013 at 1:02 PM, Vinod Kone <[email protected]>
> >> > wrote:
> >> > > >
> >> > > > > ---------- Forwarded message ----------
> >> > > > > From: Vinod Kone <[email protected]>
> >> > > > > Date: Sun, May 19, 2013 at 6:56 PM
> >> > > > > Subject: Re: Question about TASK_LOST statuses
> >> > > > > To: "[email protected]" <
> >> [email protected]
> >> > >
> >> > > > >
> >> > > > >
> >> > > > > On the master's logs, I see this:
> >> > > > >
> >> > > > > > - 5600+ instances of "Error validating task XXX: Task uses
> >> invalid
> >> > > > slave:
> >> > > > > > SOME_UUID"
> >> > > > > >
> >> > > > > What do you think the problem is? I am copying the slave_id from
> >> the
> >> > > > offer
> >> > > > > > into the TaskInfo protobuf.
> >> > > > > >
> >> > > > > >
> >> > > > > This will happen if the slave id in the task doesn't match the
> >> slave
> >> > id
> >> > > > in
> >> > > > > the slave. Are you sure you are doing the copying the right
> slave
> >> ids
> >> > > to
> >> > > > > the right tasks? Looks like there is a mismatch. Maybe some
> >> > > logs/printfs
> >> > > > on
> >> > > > > your scheduler, when you launch tasks, can point out the issue.
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > > > I'm using the process-based isolation at the moment (I haven't
> >> had
> >> > > the
> >> > > > > time
> >> > > > > > to set up the cgroups isolation yet).
> >> > > > > >
> >> > > > > > I can find and share whatever else is needed so that we can
> >> figure
> >> > > out
> >> > > > > why
> >> > > > > > these messages are occurring.
> >> > > > > >
> >> > > > > > Thanks,
> >> > > > > > David
> >> > > > > >
> >> > > > > >
> >> > > > > > On Fri, May 17, 2013 at 5:16 PM, Vinod Kone <
> >> [email protected]>
> >> > > > wrote:
> >> > > > > >
> >> > > > > > > Hi David,
> >> > > > > > >
> >> > > > > > > You are right in that all these status updates are what we
> >> call
> >> > > > > > "terminal"
> >> > > > > > > status updates and mesos takes specific actions when it
> >> > > > gets/generates
> >> > > > > > one
> >> > > > > > > of these.
> >> > > > > > >
> >> > > > > > > TASK_LOST is special in the sense that is not generated by
> the
> >> > > > > executor,
> >> > > > > > > but by the slave/master. You could think of it as an
> >> exception in
> >> > > > > mesos.
> >> > > > > > > Clearly, these should be rare in a stable mesos system.
> >> > > > > > >
> >> > > > > > > What do your logs say about the TASK_LOSTs? Is it always the
> >> same
> >> > > > > issue?
> >> > > > > > > Are you running w/ cgroups?
> >> > > > > > >
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > On Fri, May 17, 2013 at 2:04 PM, David Greenberg <
> >> > > > > [email protected]
> >> > > > > > > >wrote:
> >> > > > > > >
> >> > > > > > > > Hello! Today I began working on a more advanced version of
> >> > > > > mesos-submit
> >> > > > > > > > that will handle hot-spares.
> >> > > > > > > >
> >> > > > > > > > I was assuming that TASK_{FAILED,FINISHED,LOST,KILLED}
> were
> >> the
> >> > > > > status
> >> > > > > > > > updates that meant that I needed to start a new spare
> >> process,
> >> > as
> >> > > > the
> >> > > > > > > > monitored task was killed. However, I noticed that I often
> >> > > recieved
> >> > > > > > > > TASK_LOSTs, and every 5 seconds, my scheduler would think
> >> its
> >> > > tasks
> >> > > > > had
> >> > > > > > > all
> >> > > > > > > > died, so it'd restart too many. Nevertheless, the tasks
> >> would
> >> > > > > reappear
> >> > > > > > > > later on, and I could see them in the web interface of
> >> Mesos,
> >> > > > > > continuing
> >> > > > > > > to
> >> > > > > > > > run.
> >> > > > > > > >
> >> > > > > > > > What is going on?
> >> > > > > > > >
> >> > > > > > > > Thanks!
> >> > > > > > > > David
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>

Re: Question about TASK_LOST statuses

Reply via email to