Re: Question about TASK_LOST statuses

Benjamin Mahler Thu, 13 Jun 2013 18:21:44 -0700

Ok I'll try to do one thing at a time here, the first thing I'm seeing is
that you have an executor terminating.


I0611 20:19:58.519618 48373 process_based_isolation_module.cpp:344] Telling
slave of lost executor cc54e5a4-ca40-444b-9286-72212bf012b5 of framework
201305261216-3261142444-5050-56457-0006

This is fine. We've actually changed this message since 0.12.0 to say
"terminated" as opposed to "lost".

However, this executor was running tasks! As a result, the slave considers
these tasks as lost, and sends the appropriate status updates for them:

I0611 20:19:58.519785 48401 slave.cpp:1065] Executor
'cc54e5a4-ca40-444b-9286-72212bf012b5' of framework
201305261216-3261142444-5050-56457-0006 has exited with status 0
I0611 20:19:58.525691 48401 slave.cpp:842] Status update: task
cc54e5a4-ca40-444b-9286-72212bf012b5 of framework
201305261216-3261142444-5050-56457-0006 is now in state TASK_LOST

Since I see an exit status of 0, I'm assuming this is a clean shutdown of a
custom executor that you've written? If so, you'll need to send terminal
updates for the tasks you're running prior to shutting down the executor.
E.g. TASK_FINISHED. Otherwise, the slave will consider all tasks running on
the executor as LOST. Does that clear anything up?


On Wed, Jun 12, 2013 at 4:39 PM, David Greenberg <[email protected]>wrote:

> Sure, sorry I didn't post the link--I'm on a restricted network at work
> that blocks uploading sites. Here it is:
> https://www.dropbox.com/s/bhapvvq6kznlgyz/master_and_slave_logs.tar.bz2
>
> Currently, I'm trying to set up Hadoop and Spark on Mesos for ad-hoc data
> analysis tasks. I also wrote a Clojure fluent library for working with
> Mesos, which I intend to use to build a new scheduler for a specific
> problem at work on our 700 machine cluster. Some of the Clojure work will
> be open source (EPL) once I've written better documentation and actually
> had an opportunity to test it.
>
> Thanks!
>
>
> On Wed, Jun 12, 2013 at 6:28 PM, Benjamin Mahler
> <[email protected]>wrote:
>
> > Can you link to the logs?
> >
> > Can you give us a little background about how you're using mesos? If
> you're
> > using it for production jobs, I would recommend 0.12.0 once released as
> it
> > has been vetted in production (at Twitter at least). We've also included
> > instructions on how to upgrade from 0.11.0 to 0.12.0 on a running
> cluster.
> >
> >
> > On Wed, Jun 12, 2013 at 7:07 AM, David Greenberg <[email protected]
> > >wrote:
> >
> > > I am on 0.12 right now, git revision
> > > 3758114ee4492dcbb784d5aac65d43ac54ddb439 (same as airbnb/chronos
> > > recomends).
> > >
> > > I've the master and slave logs are 1.7MB bz2'ed, but apache.org's
> mailer
> > > doesn't accept such large messages. I've sent them directly to VInod,
> > and I
> > > can send them to anyone else who asks.
> > >
> > > I'm just running mesos w/ --conf, and the config is
> > >
> > > master = zk://iadv1.pit.mycompany.com:2181,
> iadv2.pit.mycompany.com:2181,
> > > iadv3.pit.mycompany.com:2181,iadv4.pit.mycompany.com:2181,
> > > iadv5.pit.mycompany.com:2181/mesos
> > > zk = zk://iadv1.pit.mycompany.com:2181,iadv2.pit.mycompany.com:2181,
> > > iadv3.pit.mycompany.com:2181,iadv4.pit.mycompany.com:2181,
> > > iadv5.pit.mycompany.com:2181/mesos
> > > log_dir = /data/scratch/local/mesos/logs
> > > work_dir = /data/scratch/local/mesos/work
> > >
> > >
> > > I would be happy to move to the latest version that's likely stable,
> but
> > > even after reading all of the discussion over the past couple weeks on
> > > 0.11, 0.12, and 0.13, I have no idea whether I should pick one of
> those,
> > > HEAD, or some other commit.
> > >
> > > Thank you!
> > >
> > >
> > > On Wed, Jun 12, 2013 at 10:01 AM, David Greenberg <
> > [email protected]
> > > >wrote:
> > >
> > > > I am on 0.12 right now, git revision
> > > > 3758114ee4492dcbb784d5aac65d43ac54ddb439 (same as airbnb/chronos
> > > recomends).
> > > >
> > > > I've attached the master and slave logs. I'm just running mesos w/
> > > --conf,
> > > > and the config is
> > > >
> > > > master = zk://iadv1.pit.mycompany.com:2181,
> > iadv2.pit.mycompany.com:2181,
> > > > iadv3.pit.mycompany.com:2181,iadv4.pit.mycompany.com:2181,
> > > > iadv5.pit.mycompany.com:2181/mesos
> > > > zk = zk://iadv1.pit.mycompany.com:2181,iadv2.pit.mycompany.com:2181,
> > > > iadv3.pit.mycompany.com:2181,iadv4.pit.mycompany.com:2181,
> > > > iadv5.pit.mycompany.com:2181/mesos
> > > > log_dir = /data/scratch/local/mesos/logs
> > > > work_dir = /data/scratch/local/mesos/work
> > > >
> > > >
> > > > I would be happy to move to the latest version that's likely stable,
> > but
> > > > even after reading all of the discussion over the past couple weeks
> on
> > > > 0.11, 0.12, and 0.13, I have no idea whether I should pick one of
> > those,
> > > > HEAD, or some other commit.
> > > >
> > > > Thank you!
> > > >
> > > > On Tue, Jun 11, 2013 at 4:53 PM, Vinod Kone <[email protected]>
> > wrote:
> > > >
> > > >> What version of mesos are you running? Some logs and command lines
> > would
> > > >> be
> > > >> great to debug here.
> > > >>
> > > >>
> > > >> On Tue, Jun 11, 2013 at 1:48 PM, David Greenberg <
> > > [email protected]
> > > >> >wrote:
> > > >>
> > > >> > So, I got a new, fresh Mac that I installed Mesos, Zookeeper, and
> > > other
> > > >> > local processes on.
> > > >> >
> > > >> > When I test the code locally on the Mac (which has intermittant
> > > network
> > > >> > connectivity), it runs fine for several minutes, then crashes OSX
> > (the
> > > >> > machine hardlocks), which is strange because I don't observe CPU
> or
> > > >> memory
> > > >> > spikes, or network events (which I'm logging at 1s intervals).
> > > >> >
> > > >> > When I run the code on the cluster (which is Ubuntu Linux based),
> I
> > > >> still
> > > >> > see a huge number of TASK_LOST messages, and the framework fails
> to
> > > have
> > > >> > any tasks successfully run.
> > > >> >
> > > >> > What do you think the next steps are to debug this? Could it be a
> > > lossy
> > > >> > network, or a misconfiguration of the slaves or the master or
> > > zookeeper?
> > > >> >
> > > >> > Thank you!
> > > >> >
> > > >> >
> > > >> > On Mon, Jun 3, 2013 at 4:48 PM, Benjamin Mahler
> > > >> > <[email protected]>wrote:
> > > >> >
> > > >> > > Yes, the Python bindings are still supported.
> > > >> > >
> > > >> > > Can you dump the DebugString of the TaskInfo you're
> constructing,
> > to
> > > >> > > confirm the SlaveID looks ok?
> > > >> > >
> > > >> > > Ben
> > > >> > >
> > > >> > >
> > > >> > > On Tue, May 28, 2013 at 7:06 AM, David Greenberg <
> > > >> [email protected]
> > > >> > > >wrote:
> > > >> > >
> > > >> > > > Sorry for the delayed response--I'm having some issues w/
> email
> > > >> > delivery
> > > >> > > to
> > > >> > > > gmail...
> > > >> > > >
> > > >> > > > I'm trying to use the Python binding in this application. I am
> > > >> copying
> > > >> > > from
> > > >> > > > offer.slave_id.value to task.slave_id.value using the =
> > operator.
> > > >> > > >
> > > >> > > > Is the python binding still supported? Either way, due to some
> > new
> > > >> > > > concurrency requirements, I'm going to be shifting gears into
> > > >> writing a
> > > >> > > > JVM-based Mesos framework now.
> > > >> > > >
> > > >> > > > Thanks!
> > > >> > > >
> > > >> > > >
> > > >> > > > On Thu, May 23, 2013 at 1:02 PM, Vinod Kone <
> > [email protected]>
> > > >> > wrote:
> > > >> > > >
> > > >> > > > > ---------- Forwarded message ----------
> > > >> > > > > From: Vinod Kone <[email protected]>
> > > >> > > > > Date: Sun, May 19, 2013 at 6:56 PM
> > > >> > > > > Subject: Re: Question about TASK_LOST statuses
> > > >> > > > > To: "[email protected]" <
> > > >> [email protected]
> > > >> > >
> > > >> > > > >
> > > >> > > > >
> > > >> > > > > On the master's logs, I see this:
> > > >> > > > >
> > > >> > > > > > - 5600+ instances of "Error validating task XXX: Task uses
> > > >> invalid
> > > >> > > > slave:
> > > >> > > > > > SOME_UUID"
> > > >> > > > > >
> > > >> > > > > What do you think the problem is? I am copying the slave_id
> > from
> > > >> the
> > > >> > > > offer
> > > >> > > > > > into the TaskInfo protobuf.
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > This will happen if the slave id in the task doesn't match
> the
> > > >> slave
> > > >> > id
> > > >> > > > in
> > > >> > > > > the slave. Are you sure you are doing the copying the right
> > > slave
> > > >> ids
> > > >> > > to
> > > >> > > > > the right tasks? Looks like there is a mismatch. Maybe some
> > > >> > > logs/printfs
> > > >> > > > on
> > > >> > > > > your scheduler, when you launch tasks, can point out the
> > issue.
> > > >> > > > >
> > > >> > > > >
> > > >> > > > >
> > > >> > > > > > I'm using the process-based isolation at the moment (I
> > haven't
> > > >> had
> > > >> > > the
> > > >> > > > > time
> > > >> > > > > > to set up the cgroups isolation yet).
> > > >> > > > > >
> > > >> > > > > > I can find and share whatever else is needed so that we
> can
> > > >> figure
> > > >> > > out
> > > >> > > > > why
> > > >> > > > > > these messages are occurring.
> > > >> > > > > >
> > > >> > > > > > Thanks,
> > > >> > > > > > David
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > > On Fri, May 17, 2013 at 5:16 PM, Vinod Kone <
> > > >> [email protected]>
> > > >> > > > wrote:
> > > >> > > > > >
> > > >> > > > > > > Hi David,
> > > >> > > > > > >
> > > >> > > > > > > You are right in that all these status updates are what
> we
> > > >> call
> > > >> > > > > > "terminal"
> > > >> > > > > > > status updates and mesos takes specific actions when it
> > > >> > > > gets/generates
> > > >> > > > > > one
> > > >> > > > > > > of these.
> > > >> > > > > > >
> > > >> > > > > > > TASK_LOST is special in the sense that is not generated
> by
> > > the
> > > >> > > > > executor,
> > > >> > > > > > > but by the slave/master. You could think of it as an
> > > >> exception in
> > > >> > > > > mesos.
> > > >> > > > > > > Clearly, these should be rare in a stable mesos system.
> > > >> > > > > > >
> > > >> > > > > > > What do your logs say about the TASK_LOSTs? Is it always
> > the
> > > >> same
> > > >> > > > > issue?
> > > >> > > > > > > Are you running w/ cgroups?
> > > >> > > > > > >
> > > >> > > > > > >
> > > >> > > > > > >
> > > >> > > > > > > On Fri, May 17, 2013 at 2:04 PM, David Greenberg <
> > > >> > > > > [email protected]
> > > >> > > > > > > >wrote:
> > > >> > > > > > >
> > > >> > > > > > > > Hello! Today I began working on a more advanced
> version
> > of
> > > >> > > > > mesos-submit
> > > >> > > > > > > > that will handle hot-spares.
> > > >> > > > > > > >
> > > >> > > > > > > > I was assuming that TASK_{FAILED,FINISHED,LOST,KILLED}
> > > were
> > > >> the
> > > >> > > > > status
> > > >> > > > > > > > updates that meant that I needed to start a new spare
> > > >> process,
> > > >> > as
> > > >> > > > the
> > > >> > > > > > > > monitored task was killed. However, I noticed that I
> > often
> > > >> > > recieved
> > > >> > > > > > > > TASK_LOSTs, and every 5 seconds, my scheduler would
> > think
> > > >> its
> > > >> > > tasks
> > > >> > > > > had
> > > >> > > > > > > all
> > > >> > > > > > > > died, so it'd restart too many. Nevertheless, the
> tasks
> > > >> would
> > > >> > > > > reappear
> > > >> > > > > > > > later on, and I could see them in the web interface of
> > > >> Mesos,
> > > >> > > > > > continuing
> > > >> > > > > > > to
> > > >> > > > > > > > run.
> > > >> > > > > > > >
> > > >> > > > > > > > What is going on?
> > > >> > > > > > > >
> > > >> > > > > > > > Thanks!
> > > >> > > > > > > > David
> > > >> > > > > > > >
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > > >
> > >
> >
>

Re: Question about TASK_LOST statuses

Reply via email to