Re: Question about TASK_LOST statuses

David Greenberg Fri, 14 Jun 2013 15:17:26 -0700

The update is sent from a different thread--the thread is only
created/started in the launchTask callback. I could add a short sleep
between the task send and System.exit() call and see if that helps. I
could also asynchronously exit--what event would I be waiting for?


This helps a lot--I'll try it when I get back to the cluster on Monday. Thanks!

Sent from my iPhone

On Jun 14, 2013, at 6:07 PM, Benjamin Mahler <[email protected]> wrote:

> Unfortunately for the moment with the way the ExecutorDriver is designed,
> you cannot send an update within one of the driver callbacks and then
> exit(). This is because the driver is currently blocked on executing the
> callback and cannot process the update until the callback completes.
>
> Is that what's happening in your code?
>
> Are you able to asynchronously exit the program? We also use a hack here at
> Twitter, where we wait a few seconds after sending the final update, prior
> to exiting the executor, to ensure that the update goes through.
>
> These kinds of API annoyances will be fixed in a future v2 API, but for now
> you'll have to live with the quirks. Does that help?
>
>
> On Fri, Jun 14, 2013 at 2:57 PM, David Greenberg 
> <[email protected]>wrote:
>
>> I do send terminal updates for the task:
>> https://github.com/dgrnbrg/easypaas/blob/master/src/easypaas/core.clj#L126
>>
>> The linked-to line spawns a new thread that waits for the underlying
>> process to finish, then submits the final task update and exits the
>> executor.
>>
>> On Thursday, June 13, 2013, Benjamin Mahler wrote:
>>
>>> Ok I'll try to do one thing at a time here, the first thing I'm seeing is
>>> that you have an executor terminating.
>>>
>>> I0611 20:19:58.519618 48373 process_based_isolation_module.cpp:344]
>> Telling
>>> slave of lost executor cc54e5a4-ca40-444b-9286-72212bf012b5 of framework
>>> 201305261216-3261142444-5050-56457-0006
>>>
>>> This is fine. We've actually changed this message since 0.12.0 to say
>>> "terminated" as opposed to "lost".
>>>
>>> However, this executor was running tasks! As a result, the slave
>> considers
>>> these tasks as lost, and sends the appropriate status updates for them:
>>>
>>> I0611 20:19:58.519785 48401 slave.cpp:1065] Executor
>>> 'cc54e5a4-ca40-444b-9286-72212bf012b5' of framework
>>> 201305261216-3261142444-5050-56457-0006 has exited with status 0
>>> I0611 20:19:58.525691 48401 slave.cpp:842] Status update: task
>>> cc54e5a4-ca40-444b-9286-72212bf012b5 of framework
>>> 201305261216-3261142444-5050-56457-0006 is now in state TASK_LOST
>>>
>>> Since I see an exit status of 0, I'm assuming this is a clean shutdown
>> of a
>>> custom executor that you've written? If so, you'll need to send terminal
>>> updates for the tasks you're running prior to shutting down the executor.
>>> E.g. TASK_FINISHED. Otherwise, the slave will consider all tasks running
>> on
>>> the executor as LOST. Does that clear anything up?
>>>
>>>
>>> On Wed, Jun 12, 2013 at 4:39 PM, David Greenberg <[email protected]
>> <javascript:;>
>>>> wrote:
>>>
>>>> Sure, sorry I didn't post the link--I'm on a restricted network at work
>>>> that blocks uploading sites. Here it is:
>> https://www.dropbox.com/s/bhapvvq6kznlgyz/master_and_slave_logs.tar.bz2
>>>>
>>>> Currently, I'm trying to set up Hadoop and Spark on Mesos for ad-hoc
>> data
>>>> analysis tasks. I also wrote a Clojure fluent library for working with
>>>> Mesos, which I intend to use to build a new scheduler for a specific
>>>> problem at work on our 700 machine cluster. Some of the Clojure work
>> will
>>>> be open source (EPL) once I've written better documentation and
>> actually
>>>> had an opportunity to test it.
>>>>
>>>> Thanks!
>>>>
>>>>
>>>> On Wed, Jun 12, 2013 at 6:28 PM, Benjamin Mahler
>>>> <[email protected]>wrote:
>>>>
>>>>> Can you link to the logs?
>>>>>
>>>>> Can you give us a little background about how you're using mesos? If
>>>> you're
>>>>> using it for production jobs, I would recommend 0.12.0 once released
>> as
>>>> it
>>>>> has been vetted in production (at Twitter at least). We've also
>>> included
>>>>> instructions on how to upgrade from 0.11.0 to 0.12.0 on a running
>>>> cluster.
>>>>>
>>>>>
>>>>> On Wed, Jun 12, 2013 at 7:07 AM, David Greenberg <
>>> [email protected]
>>>>>> wrote:
>>>>>
>>>>>> I am on 0.12 right now, git revision
>>>>>> 3758114ee4492dcbb784d5aac65d43ac54ddb439 (same as airbnb/chronos
>>>>>> recomends).
>>>>>>
>>>>>> I've the master and slave logs are 1.7MB bz2'ed, but apache.org's
>>>> mailer
>>>>>> doesn't accept such large messages. I've sent them directly to
>> VInod,
>>>>> and I
>>>>>> can send them to anyone else who asks.
>>>>>>
>>>>>> I'm just running mesos w/ --conf, and the config is
>>>>>>
>>>>>> master = zk://iadv1.pit.mycompany.com:2181,
>>>> iadv2.pit.mycompany.com:2181,
>>>>>> iadv3.pit.mycompany.com:2181,iadv4.pit.mycompany.com:2181,
>>>>>> iadv5.pit.mycompany.com:2181/mesos
>>>>>> zk = zk://iadv1.pit.mycompany.com:2181,
>> iadv2.pit.mycompany.com:2181,
>>>>>> iadv3.pit.mycompany.com:2181,iadv4.pit.mycompany.com:2181,
>>>>>> iadv5.pit.mycompany.com:2181/mesos
>>>>>> log_dir = /data/scratch/local/mesos/logs
>>>>>> work_dir = /data/scratch/local/mesos/work
>>>>>>
>>>>>>
>>>>>> I would be happy to move to the latest version that's likely
>> stable,
>>>> but
>>>>>> even after reading all of the discussion over the past couple weeks
>>> on
>>>>>> 0.11, 0.12, and 0.13, I have no idea whether I should pick one of
>>>> those,
>>>>>> HEAD, or some other commit.
>>>>>>
>>>>>> Thank you!
>>>>>>
>>>>>>
>>>>>> On Wed, Jun 12, 2013 at 10:01 AM, David Greenberg <
>>>>> [email protected]
>>>>>>> wrote:
>>>>>>
>>>>>>> I am on 0.12 right now, git revision
>>>>>>> 3758114ee4492dcbb784d5aac65d43ac54ddb439 (same as airbnb/chronos
>>>>>> recomends).
>>>>>>>
>>>>>>> I've attached the master and slave logs. I'm just running mesos
>> w/
>>>>>> --conf,
>>>>>>> and the config is
>>>>>>>
>>>>>>> master = zk://iadv1.pit.mycompany.com:2181,
>>>>> iadv2.pit.mycompany.com:21 <http://iadv2.pit.mycompany.com:2181>
>>

Re: Question about TASK_LOST statuses

Reply via email to