I am looking at the slave's logs, and here's what I see: - 81 instances of "Telling slave of lost executor XXX of framework YYY" - 500,000+ instances of "Failed to collect resource usage for executor XXX of framework YYY" - 8 instances of "WARNING! executor XXX of framework YYY should be shutting down"
On the master's logs, I see this: - 5600+ instances of "Error validating task XXX: Task uses invalid slave: SOME_UUID" What do you think the problem is? I am copying the slave_id from the offer into the TaskInfo protobuf. I'm using the process-based isolation at the moment (I haven't had the time to set up the cgroups isolation yet). I can find and share whatever else is needed so that we can figure out why these messages are occurring. Thanks, David On Fri, May 17, 2013 at 5:16 PM, Vinod Kone <[email protected]> wrote: > Hi David, > > You are right in that all these status updates are what we call "terminal" > status updates and mesos takes specific actions when it gets/generates one > of these. > > TASK_LOST is special in the sense that is not generated by the executor, > but by the slave/master. You could think of it as an exception in mesos. > Clearly, these should be rare in a stable mesos system. > > What do your logs say about the TASK_LOSTs? Is it always the same issue? > Are you running w/ cgroups? > > > > On Fri, May 17, 2013 at 2:04 PM, David Greenberg <[email protected] > >wrote: > > > Hello! Today I began working on a more advanced version of mesos-submit > > that will handle hot-spares. > > > > I was assuming that TASK_{FAILED,FINISHED,LOST,KILLED} were the status > > updates that meant that I needed to start a new spare process, as the > > monitored task was killed. However, I noticed that I often recieved > > TASK_LOSTs, and every 5 seconds, my scheduler would think its tasks had > all > > died, so it'd restart too many. Nevertheless, the tasks would reappear > > later on, and I could see them in the web interface of Mesos, continuing > to > > run. > > > > What is going on? > > > > Thanks! > > David > > >
