The update is sent from a different thread--the thread is only created/started in the launchTask callback. I could add a short sleep between the task send and System.exit() call and see if that helps. I could also asynchronously exit--what event would I be waiting for?
This helps a lot--I'll try it when I get back to the cluster on Monday. Thanks! Sent from my iPhone On Jun 14, 2013, at 6:07 PM, Benjamin Mahler <[email protected]> wrote: > Unfortunately for the moment with the way the ExecutorDriver is designed, > you cannot send an update within one of the driver callbacks and then > exit(). This is because the driver is currently blocked on executing the > callback and cannot process the update until the callback completes. > > Is that what's happening in your code? > > Are you able to asynchronously exit the program? We also use a hack here at > Twitter, where we wait a few seconds after sending the final update, prior > to exiting the executor, to ensure that the update goes through. > > These kinds of API annoyances will be fixed in a future v2 API, but for now > you'll have to live with the quirks. Does that help? > > > On Fri, Jun 14, 2013 at 2:57 PM, David Greenberg > <[email protected]>wrote: > >> I do send terminal updates for the task: >> https://github.com/dgrnbrg/easypaas/blob/master/src/easypaas/core.clj#L126 >> >> The linked-to line spawns a new thread that waits for the underlying >> process to finish, then submits the final task update and exits the >> executor. >> >> On Thursday, June 13, 2013, Benjamin Mahler wrote: >> >>> Ok I'll try to do one thing at a time here, the first thing I'm seeing is >>> that you have an executor terminating. >>> >>> I0611 20:19:58.519618 48373 process_based_isolation_module.cpp:344] >> Telling >>> slave of lost executor cc54e5a4-ca40-444b-9286-72212bf012b5 of framework >>> 201305261216-3261142444-5050-56457-0006 >>> >>> This is fine. We've actually changed this message since 0.12.0 to say >>> "terminated" as opposed to "lost". >>> >>> However, this executor was running tasks! As a result, the slave >> considers >>> these tasks as lost, and sends the appropriate status updates for them: >>> >>> I0611 20:19:58.519785 48401 slave.cpp:1065] Executor >>> 'cc54e5a4-ca40-444b-9286-72212bf012b5' of framework >>> 201305261216-3261142444-5050-56457-0006 has exited with status 0 >>> I0611 20:19:58.525691 48401 slave.cpp:842] Status update: task >>> cc54e5a4-ca40-444b-9286-72212bf012b5 of framework >>> 201305261216-3261142444-5050-56457-0006 is now in state TASK_LOST >>> >>> Since I see an exit status of 0, I'm assuming this is a clean shutdown >> of a >>> custom executor that you've written? If so, you'll need to send terminal >>> updates for the tasks you're running prior to shutting down the executor. >>> E.g. TASK_FINISHED. Otherwise, the slave will consider all tasks running >> on >>> the executor as LOST. Does that clear anything up? >>> >>> >>> On Wed, Jun 12, 2013 at 4:39 PM, David Greenberg <[email protected] >> <javascript:;> >>>> wrote: >>> >>>> Sure, sorry I didn't post the link--I'm on a restricted network at work >>>> that blocks uploading sites. Here it is: >> https://www.dropbox.com/s/bhapvvq6kznlgyz/master_and_slave_logs.tar.bz2 >>>> >>>> Currently, I'm trying to set up Hadoop and Spark on Mesos for ad-hoc >> data >>>> analysis tasks. I also wrote a Clojure fluent library for working with >>>> Mesos, which I intend to use to build a new scheduler for a specific >>>> problem at work on our 700 machine cluster. Some of the Clojure work >> will >>>> be open source (EPL) once I've written better documentation and >> actually >>>> had an opportunity to test it. >>>> >>>> Thanks! >>>> >>>> >>>> On Wed, Jun 12, 2013 at 6:28 PM, Benjamin Mahler >>>> <[email protected]>wrote: >>>> >>>>> Can you link to the logs? >>>>> >>>>> Can you give us a little background about how you're using mesos? If >>>> you're >>>>> using it for production jobs, I would recommend 0.12.0 once released >> as >>>> it >>>>> has been vetted in production (at Twitter at least). We've also >>> included >>>>> instructions on how to upgrade from 0.11.0 to 0.12.0 on a running >>>> cluster. >>>>> >>>>> >>>>> On Wed, Jun 12, 2013 at 7:07 AM, David Greenberg < >>> [email protected] >>>>>> wrote: >>>>> >>>>>> I am on 0.12 right now, git revision >>>>>> 3758114ee4492dcbb784d5aac65d43ac54ddb439 (same as airbnb/chronos >>>>>> recomends). >>>>>> >>>>>> I've the master and slave logs are 1.7MB bz2'ed, but apache.org's >>>> mailer >>>>>> doesn't accept such large messages. I've sent them directly to >> VInod, >>>>> and I >>>>>> can send them to anyone else who asks. >>>>>> >>>>>> I'm just running mesos w/ --conf, and the config is >>>>>> >>>>>> master = zk://iadv1.pit.mycompany.com:2181, >>>> iadv2.pit.mycompany.com:2181, >>>>>> iadv3.pit.mycompany.com:2181,iadv4.pit.mycompany.com:2181, >>>>>> iadv5.pit.mycompany.com:2181/mesos >>>>>> zk = zk://iadv1.pit.mycompany.com:2181, >> iadv2.pit.mycompany.com:2181, >>>>>> iadv3.pit.mycompany.com:2181,iadv4.pit.mycompany.com:2181, >>>>>> iadv5.pit.mycompany.com:2181/mesos >>>>>> log_dir = /data/scratch/local/mesos/logs >>>>>> work_dir = /data/scratch/local/mesos/work >>>>>> >>>>>> >>>>>> I would be happy to move to the latest version that's likely >> stable, >>>> but >>>>>> even after reading all of the discussion over the past couple weeks >>> on >>>>>> 0.11, 0.12, and 0.13, I have no idea whether I should pick one of >>>> those, >>>>>> HEAD, or some other commit. >>>>>> >>>>>> Thank you! >>>>>> >>>>>> >>>>>> On Wed, Jun 12, 2013 at 10:01 AM, David Greenberg < >>>>> [email protected] >>>>>>> wrote: >>>>>> >>>>>>> I am on 0.12 right now, git revision >>>>>>> 3758114ee4492dcbb784d5aac65d43ac54ddb439 (same as airbnb/chronos >>>>>> recomends). >>>>>>> >>>>>>> I've attached the master and slave logs. I'm just running mesos >> w/ >>>>>> --conf, >>>>>>> and the config is >>>>>>> >>>>>>> master = zk://iadv1.pit.mycompany.com:2181, >>>>> iadv2.pit.mycompany.com:21 <http://iadv2.pit.mycompany.com:2181> >>
