Re: Question about TASK_LOST statuses

David Greenberg Wed, 12 Jun 2013 16:41:01 -0700

Sure, sorry I didn't post the link--I'm on a restricted network at work
that blocks uploading sites. Here it is:
https://www.dropbox.com/s/bhapvvq6kznlgyz/master_and_slave_logs.tar.bz2


Currently, I'm trying to set up Hadoop and Spark on Mesos for ad-hoc data
analysis tasks. I also wrote a Clojure fluent library for working with
Mesos, which I intend to use to build a new scheduler for a specific
problem at work on our 700 machine cluster. Some of the Clojure work will
be open source (EPL) once I've written better documentation and actually
had an opportunity to test it.

Thanks!


On Wed, Jun 12, 2013 at 6:28 PM, Benjamin Mahler
<[email protected]>wrote:

> Can you link to the logs?
>
> Can you give us a little background about how you're using mesos? If you're
> using it for production jobs, I would recommend 0.12.0 once released as it
> has been vetted in production (at Twitter at least). We've also included
> instructions on how to upgrade from 0.11.0 to 0.12.0 on a running cluster.
>
>
> On Wed, Jun 12, 2013 at 7:07 AM, David Greenberg <[email protected]
> >wrote:
>
> > I am on 0.12 right now, git revision
> > 3758114ee4492dcbb784d5aac65d43ac54ddb439 (same as airbnb/chronos
> > recomends).
> >
> > I've the master and slave logs are 1.7MB bz2'ed, but apache.org's mailer
> > doesn't accept such large messages. I've sent them directly to VInod,
> and I
> > can send them to anyone else who asks.
> >
> > I'm just running mesos w/ --conf, and the config is
> >
> > master = zk://iadv1.pit.mycompany.com:2181,iadv2.pit.mycompany.com:2181,
> > iadv3.pit.mycompany.com:2181,iadv4.pit.mycompany.com:2181,
> > iadv5.pit.mycompany.com:2181/mesos
> > zk = zk://iadv1.pit.mycompany.com:2181,iadv2.pit.mycompany.com:2181,
> > iadv3.pit.mycompany.com:2181,iadv4.pit.mycompany.com:2181,
> > iadv5.pit.mycompany.com:2181/mesos
> > log_dir = /data/scratch/local/mesos/logs
> > work_dir = /data/scratch/local/mesos/work
> >
> >
> > I would be happy to move to the latest version that's likely stable, but
> > even after reading all of the discussion over the past couple weeks on
> > 0.11, 0.12, and 0.13, I have no idea whether I should pick one of those,
> > HEAD, or some other commit.
> >
> > Thank you!
> >
> >
> > On Wed, Jun 12, 2013 at 10:01 AM, David Greenberg <
> [email protected]
> > >wrote:
> >
> > > I am on 0.12 right now, git revision
> > > 3758114ee4492dcbb784d5aac65d43ac54ddb439 (same as airbnb/chronos
> > recomends).
> > >
> > > I've attached the master and slave logs. I'm just running mesos w/
> > --conf,
> > > and the config is
> > >
> > > master = zk://iadv1.pit.mycompany.com:2181,
> iadv2.pit.mycompany.com:2181,
> > > iadv3.pit.mycompany.com:2181,iadv4.pit.mycompany.com:2181,
> > > iadv5.pit.mycompany.com:2181/mesos
> > > zk = zk://iadv1.pit.mycompany.com:2181,iadv2.pit.mycompany.com:2181,
> > > iadv3.pit.mycompany.com:2181,iadv4.pit.mycompany.com:2181,
> > > iadv5.pit.mycompany.com:2181/mesos
> > > log_dir = /data/scratch/local/mesos/logs
> > > work_dir = /data/scratch/local/mesos/work
> > >
> > >
> > > I would be happy to move to the latest version that's likely stable,
> but
> > > even after reading all of the discussion over the past couple weeks on
> > > 0.11, 0.12, and 0.13, I have no idea whether I should pick one of
> those,
> > > HEAD, or some other commit.
> > >
> > > Thank you!
> > >
> > > On Tue, Jun 11, 2013 at 4:53 PM, Vinod Kone <[email protected]>
> wrote:
> > >
> > >> What version of mesos are you running? Some logs and command lines
> would
> > >> be
> > >> great to debug here.
> > >>
> > >>
> > >> On Tue, Jun 11, 2013 at 1:48 PM, David Greenberg <
> > [email protected]
> > >> >wrote:
> > >>
> > >> > So, I got a new, fresh Mac that I installed Mesos, Zookeeper, and
> > other
> > >> > local processes on.
> > >> >
> > >> > When I test the code locally on the Mac (which has intermittant
> > network
> > >> > connectivity), it runs fine for several minutes, then crashes OSX
> (the
> > >> > machine hardlocks), which is strange because I don't observe CPU or
> > >> memory
> > >> > spikes, or network events (which I'm logging at 1s intervals).
> > >> >
> > >> > When I run the code on the cluster (which is Ubuntu Linux based), I
> > >> still
> > >> > see a huge number of TASK_LOST messages, and the framework fails to
> > have
> > >> > any tasks successfully run.
> > >> >
> > >> > What do you think the next steps are to debug this? Could it be a
> > lossy
> > >> > network, or a misconfiguration of the slaves or the master or
> > zookeeper?
> > >> >
> > >> > Thank you!
> > >> >
> > >> >
> > >> > On Mon, Jun 3, 2013 at 4:48 PM, Benjamin Mahler
> > >> > <[email protected]>wrote:
> > >> >
> > >> > > Yes, the Python bindings are still supported.
> > >> > >
> > >> > > Can you dump the DebugString of the TaskInfo you're constructing,
> to
> > >> > > confirm the SlaveID looks ok?
> > >> > >
> > >> > > Ben
> > >> > >
> > >> > >
> > >> > > On Tue, May 28, 2013 at 7:06 AM, David Greenberg <
> > >> [email protected]
> > >> > > >wrote:
> > >> > >
> > >> > > > Sorry for the delayed response--I'm having some issues w/ email
> > >> > delivery
> > >> > > to
> > >> > > > gmail...
> > >> > > >
> > >> > > > I'm trying to use the Python binding in this application. I am
> > >> copying
> > >> > > from
> > >> > > > offer.slave_id.value to task.slave_id.value using the =
> operator.
> > >> > > >
> > >> > > > Is the python binding still supported? Either way, due to some
> new
> > >> > > > concurrency requirements, I'm going to be shifting gears into
> > >> writing a
> > >> > > > JVM-based Mesos framework now.
> > >> > > >
> > >> > > > Thanks!
> > >> > > >
> > >> > > >
> > >> > > > On Thu, May 23, 2013 at 1:02 PM, Vinod Kone <
> [email protected]>
> > >> > wrote:
> > >> > > >
> > >> > > > > ---------- Forwarded message ----------
> > >> > > > > From: Vinod Kone <[email protected]>
> > >> > > > > Date: Sun, May 19, 2013 at 6:56 PM
> > >> > > > > Subject: Re: Question about TASK_LOST statuses
> > >> > > > > To: "[email protected]" <
> > >> [email protected]
> > >> > >
> > >> > > > >
> > >> > > > >
> > >> > > > > On the master's logs, I see this:
> > >> > > > >
> > >> > > > > > - 5600+ instances of "Error validating task XXX: Task uses
> > >> invalid
> > >> > > > slave:
> > >> > > > > > SOME_UUID"
> > >> > > > > >
> > >> > > > > What do you think the problem is? I am copying the slave_id
> from
> > >> the
> > >> > > > offer
> > >> > > > > > into the TaskInfo protobuf.
> > >> > > > > >
> > >> > > > > >
> > >> > > > > This will happen if the slave id in the task doesn't match the
> > >> slave
> > >> > id
> > >> > > > in
> > >> > > > > the slave. Are you sure you are doing the copying the right
> > slave
> > >> ids
> > >> > > to
> > >> > > > > the right tasks? Looks like there is a mismatch. Maybe some
> > >> > > logs/printfs
> > >> > > > on
> > >> > > > > your scheduler, when you launch tasks, can point out the
> issue.
> > >> > > > >
> > >> > > > >
> > >> > > > >
> > >> > > > > > I'm using the process-based isolation at the moment (I
> haven't
> > >> had
> > >> > > the
> > >> > > > > time
> > >> > > > > > to set up the cgroups isolation yet).
> > >> > > > > >
> > >> > > > > > I can find and share whatever else is needed so that we can
> > >> figure
> > >> > > out
> > >> > > > > why
> > >> > > > > > these messages are occurring.
> > >> > > > > >
> > >> > > > > > Thanks,
> > >> > > > > > David
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > On Fri, May 17, 2013 at 5:16 PM, Vinod Kone <
> > >> [email protected]>
> > >> > > > wrote:
> > >> > > > > >
> > >> > > > > > > Hi David,
> > >> > > > > > >
> > >> > > > > > > You are right in that all these status updates are what we
> > >> call
> > >> > > > > > "terminal"
> > >> > > > > > > status updates and mesos takes specific actions when it
> > >> > > > gets/generates
> > >> > > > > > one
> > >> > > > > > > of these.
> > >> > > > > > >
> > >> > > > > > > TASK_LOST is special in the sense that is not generated by
> > the
> > >> > > > > executor,
> > >> > > > > > > but by the slave/master. You could think of it as an
> > >> exception in
> > >> > > > > mesos.
> > >> > > > > > > Clearly, these should be rare in a stable mesos system.
> > >> > > > > > >
> > >> > > > > > > What do your logs say about the TASK_LOSTs? Is it always
> the
> > >> same
> > >> > > > > issue?
> > >> > > > > > > Are you running w/ cgroups?
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > > On Fri, May 17, 2013 at 2:04 PM, David Greenberg <
> > >> > > > > [email protected]
> > >> > > > > > > >wrote:
> > >> > > > > > >
> > >> > > > > > > > Hello! Today I began working on a more advanced version
> of
> > >> > > > > mesos-submit
> > >> > > > > > > > that will handle hot-spares.
> > >> > > > > > > >
> > >> > > > > > > > I was assuming that TASK_{FAILED,FINISHED,LOST,KILLED}
> > were
> > >> the
> > >> > > > > status
> > >> > > > > > > > updates that meant that I needed to start a new spare
> > >> process,
> > >> > as
> > >> > > > the
> > >> > > > > > > > monitored task was killed. However, I noticed that I
> often
> > >> > > recieved
> > >> > > > > > > > TASK_LOSTs, and every 5 seconds, my scheduler would
> think
> > >> its
> > >> > > tasks
> > >> > > > > had
> > >> > > > > > > all
> > >> > > > > > > > died, so it'd restart too many. Nevertheless, the tasks
> > >> would
> > >> > > > > reappear
> > >> > > > > > > > later on, and I could see them in the web interface of
> > >> Mesos,
> > >> > > > > > continuing
> > >> > > > > > > to
> > >> > > > > > > > run.
> > >> > > > > > > >
> > >> > > > > > > > What is going on?
> > >> > > > > > > >
> > >> > > > > > > > Thanks!
> > >> > > > > > > > David
> > >> > > > > > > >
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
>

Re: Question about TASK_LOST statuses

Reply via email to