Hey Vinod and other mesos devs, I was wondering if the information below was at all useful for understanding why so many task_lost messages are occurring in my mesos cluster?
Thanks! David On Saturday, May 18, 2013, David Greenberg wrote: > I am looking at the slave's logs, and here's what I see: > - 81 instances of "Telling slave of lost executor XXX of framework YYY" > - 500,000+ instances of "Failed to collect resource usage for executor XXX > of framework YYY" > - 8 instances of "WARNING! executor XXX of framework YYY should be > shutting down" > > On the master's logs, I see this: > - 5600+ instances of "Error validating task XXX: Task uses invalid slave: > SOME_UUID" > > What do you think the problem is? I am copying the slave_id from the offer > into the TaskInfo protobuf. > > I'm using the process-based isolation at the moment (I haven't had the > time to set up the cgroups isolation yet). > > I can find and share whatever else is needed so that we can figure out why > these messages are occurring. > > Thanks, > David > > > On Fri, May 17, 2013 at 5:16 PM, Vinod Kone > <[email protected]<javascript:_e({}, 'cvml', '[email protected]');> > > wrote: > >> Hi David, >> >> You are right in that all these status updates are what we call "terminal" >> status updates and mesos takes specific actions when it gets/generates one >> of these. >> >> TASK_LOST is special in the sense that is not generated by the executor, >> but by the slave/master. You could think of it as an exception in mesos. >> Clearly, these should be rare in a stable mesos system. >> >> What do your logs say about the TASK_LOSTs? Is it always the same issue? >> Are you running w/ cgroups? >> >> >> >> On Fri, May 17, 2013 at 2:04 PM, David Greenberg >> <[email protected]<javascript:_e({}, 'cvml', '[email protected]');> >> >wrote: >> >> > Hello! Today I began working on a more advanced version of mesos-submit >> > that will handle hot-spares. >> > >> > I was assuming that TASK_{FAILED,FINISHED,LOST,KILLED} were the status >> > updates that meant that I needed to start a new spare process, as the >> > monitored task was killed. However, I noticed that I often recieved >> > TASK_LOSTs, and every 5 seconds, my scheduler would think its tasks had >> all >> > died, so it'd restart too many. Nevertheless, the tasks would reappear >> > later on, and I could see them in the web interface of Mesos, >> continuing to >> > run. >> > >> > What is going on? >> > >> > Thanks! >> > David >> > >> > >
