Re: Question about TASK_LOST statuses

Benjamin Mahler Fri, 14 Jun 2013 15:31:25 -0700

We do have an acknowledgement that is sent back to the ExecutorDriver but
it currently is not provided through the Executor API. In the future that
could be your signal to safely exit().


For now I would advise sleeping for a few seconds or longer (10 seconds is
what we use at Twitter) if you want to be really resistant to networking
issues.

Hope that gets things sorted out, let me know how it goes!


On Fri, Jun 14, 2013 at 3:16 PM, David Greenberg <[email protected]>wrote:

> I do send terminal updates for the task:
> https://github.com/dgrnbrg/easypaas/blob/master/src/easypaas/core.clj#L126
>
> The linked-to line spawns a new thread that waits for the underlying
> process to finish, then submits the final task update and exits the
> executor.
>
> On Thursday, June 13, 2013, Benjamin Mahler wrote:
>
> > Ok I'll try to do one thing at a time here, the first thing I'm seeing is
> > that you have an executor terminating.
> >
> > I0611 20:19:58.519618 48373 process_based_isolation_module.cpp:344]
> Telling
> > slave of lost executor cc54e5a4-ca40-444b-9286-72212bf012b5 of framework
> > 201305261216-3261142444-5050-56457-0006
> >
> > This is fine. We've actually changed this message since 0.12.0 to say
> > "terminated" as opposed to "lost".
> >
> > However, this executor was running tasks! As a result, the slave
> considers
> > these tasks as lost, and sends the appropriate status updates for them:
> >
> > I0611 20:19:58.519785 48401 slave.cpp:1065] Executor
> > 'cc54e5a4-ca40-444b-9286-72212bf012b5' of framework
> > 201305261216-3261142444-5050-56457-0006 has exited with status 0
> > I0611 20:19:58.525691 48401 slave.cpp:842] Status update: task
> > cc54e5a4-ca40-444b-9286-72212bf012b5 of framework
> > 201305261216-3261142444-5050-56457-0006 is now in state TASK_LOST
> >
> > Since I see an exit status of 0, I'm assuming this is a clean shutdown
> of a
> > custom executor that you've written? If so, you'll need to send terminal
> > updates for the tasks you're running prior to shutting down the executor.
> > E.g. TASK_FINISHED. Otherwise, the slave will consider all tasks running
> on
> > the executor as LOST. Does that clear anything up?
> >
> >
> > On Wed, Jun 12, 2013 at 4:39 PM, David Greenberg <[email protected]
> <javascript:;>
> > >wrote:
> >
> > > Sure, sorry I didn't post the link--I'm on a restricted network at work
> > > that blocks uploading sites. Here it is:
> > >
> https://www.dropbox.com/s/bhapvvq6kznlgyz/master_and_slave_logs.tar.bz2
> > >
> > > Currently, I'm trying to set up Hadoop and Spark on Mesos for ad-hoc
> data
> > > analysis tasks. I also wrote a Clojure fluent library for working with
> > > Mesos, which I intend to use to build a new scheduler for a specific
> > > problem at work on our 700 machine cluster. Some of the Clojure work
> will
> > > be open source (EPL) once I've written better documentation and
> actually
> > > had an opportunity to test it.
> > >
> > > Thanks!
> > >
> > >
> > > On Wed, Jun 12, 2013 at 6:28 PM, Benjamin Mahler
> > > <[email protected]>wrote:
> > >
> > > > Can you link to the logs?
> > > >
> > > > Can you give us a little background about how you're using mesos? If
> > > you're
> > > > using it for production jobs, I would recommend 0.12.0 once released
> as
> > > it
> > > > has been vetted in production (at Twitter at least). We've also
> > included
> > > > instructions on how to upgrade from 0.11.0 to 0.12.0 on a running
> > > cluster.
> > > >
> > > >
> > > > On Wed, Jun 12, 2013 at 7:07 AM, David Greenberg <
> > [email protected]
> > > > >wrote:
> > > >
> > > > > I am on 0.12 right now, git revision
> > > > > 3758114ee4492dcbb784d5aac65d43ac54ddb439 (same as airbnb/chronos
> > > > > recomends).
> > > > >
> > > > > I've the master and slave logs are 1.7MB bz2'ed, but apache.org's
> > > mailer
> > > > > doesn't accept such large messages. I've sent them directly to
> VInod,
> > > > and I
> > > > > can send them to anyone else who asks.
> > > > >
> > > > > I'm just running mesos w/ --conf, and the config is
> > > > >
> > > > > master = zk://iadv1.pit.mycompany.com:2181,
> > > iadv2.pit.mycompany.com:2181,
> > > > > iadv3.pit.mycompany.com:2181,iadv4.pit.mycompany.com:2181,
> > > > > iadv5.pit.mycompany.com:2181/mesos
> > > > > zk = zk://iadv1.pit.mycompany.com:2181,
> iadv2.pit.mycompany.com:2181,
> > > > > iadv3.pit.mycompany.com:2181,iadv4.pit.mycompany.com:2181,
> > > > > iadv5.pit.mycompany.com:2181/mesos
> > > > > log_dir = /data/scratch/local/mesos/logs
> > > > > work_dir = /data/scratch/local/mesos/work
> > > > >
> > > > >
> > > > > I would be happy to move to the latest version that's likely
> stable,
> > > but
> > > > > even after reading all of the discussion over the past couple weeks
> > on
> > > > > 0.11, 0.12, and 0.13, I have no idea whether I should pick one of
> > > those,
> > > > > HEAD, or some other commit.
> > > > >
> > > > > Thank you!
> > > > >
> > > > >
> > > > > On Wed, Jun 12, 2013 at 10:01 AM, David Greenberg <
> > > > [email protected]
> > > > > >wrote:
> > > > >
> > > > > > I am on 0.12 right now, git revision
> > > > > > 3758114ee4492dcbb784d5aac65d43ac54ddb439 (same as airbnb/chronos
> > > > > recomends).
> > > > > >
> > > > > > I've attached the master and slave logs. I'm just running mesos
> w/
> > > > > --conf,
> > > > > > and the config is
> > > > > >
> > > > > > master = zk://iadv1.pit.mycompany.com:2181,
> > > > iadv2.pit.mycompany.com:21 <http://iadv2.pit.mycompany.com:2181>
>

Re: Question about TASK_LOST statuses

Reply via email to