Re: Question about TASK_LOST statuses

David Greenberg Thu, 23 May 2013 05:08:31 -0700

Hey Vinod and other mesos devs,
I was wondering if the information below was at all useful for
understanding why so many task_lost messages are occurring in my mesos
cluster?


Thanks!
David

On Saturday, May 18, 2013, David Greenberg wrote:

> I am looking at the slave's logs, and here's what I see:
> - 81 instances of "Telling slave of lost executor XXX of framework YYY"
> - 500,000+ instances of "Failed to collect resource usage for executor XXX
> of framework YYY"
> - 8 instances of "WARNING! executor XXX of framework YYY should be
> shutting down"
>
> On the master's logs, I see this:
> - 5600+ instances of "Error validating task XXX: Task uses invalid slave:
> SOME_UUID"
>
> What do you think the problem is? I am copying the slave_id from the offer
> into the TaskInfo protobuf.
>
> I'm using the process-based isolation at the moment (I haven't had the
> time to set up the cgroups isolation yet).
>
> I can find and share whatever else is needed so that we can figure out why
> these messages are occurring.
>
> Thanks,
> David
>
>
> On Fri, May 17, 2013 at 5:16 PM, Vinod Kone 
> <[email protected]<javascript:_e({}, 'cvml', '[email protected]');>
> > wrote:
>
>> Hi David,
>>
>> You are right in that all these status updates are what we call "terminal"
>> status updates and mesos takes specific actions when it gets/generates one
>> of these.
>>
>> TASK_LOST is special in the sense that is not generated by the executor,
>> but by the slave/master. You could think of it as an exception in mesos.
>> Clearly, these should be rare in a stable mesos system.
>>
>> What do your logs say about the TASK_LOSTs? Is it always the same issue?
>> Are you running w/ cgroups?
>>
>>
>>
>> On Fri, May 17, 2013 at 2:04 PM, David Greenberg 
>> <[email protected]<javascript:_e({}, 'cvml', '[email protected]');>
>> >wrote:
>>
>> > Hello! Today I began working on a more advanced version of mesos-submit
>> > that will handle hot-spares.
>> >
>> > I was assuming that TASK_{FAILED,FINISHED,LOST,KILLED} were the status
>> > updates that meant that I needed to start a new spare process, as the
>> > monitored task was killed. However, I noticed that I often recieved
>> > TASK_LOSTs, and every 5 seconds, my scheduler would think its tasks had
>> all
>> > died, so it'd restart too many. Nevertheless, the tasks would reappear
>> > later on, and I could see them in the web interface of Mesos,
>> continuing to
>> > run.
>> >
>> > What is going on?
>> >
>> > Thanks!
>> > David
>> >
>>
>
>

Re: Question about TASK_LOST statuses

Reply via email to