---------- Forwarded message ---------- From: Vinod Kone <[email protected]> Date: Sun, May 19, 2013 at 6:56 PM Subject: Re: Question about TASK_LOST statuses To: "[email protected]" <[email protected]>
On the master's logs, I see this: > - 5600+ instances of "Error validating task XXX: Task uses invalid slave: > SOME_UUID" > What do you think the problem is? I am copying the slave_id from the offer > into the TaskInfo protobuf. > > This will happen if the slave id in the task doesn't match the slave id in the slave. Are you sure you are doing the copying the right slave ids to the right tasks? Looks like there is a mismatch. Maybe some logs/printfs on your scheduler, when you launch tasks, can point out the issue. > I'm using the process-based isolation at the moment (I haven't had the time > to set up the cgroups isolation yet). > > I can find and share whatever else is needed so that we can figure out why > these messages are occurring. > > Thanks, > David > > > On Fri, May 17, 2013 at 5:16 PM, Vinod Kone <[email protected]> wrote: > > > Hi David, > > > > You are right in that all these status updates are what we call > "terminal" > > status updates and mesos takes specific actions when it gets/generates > one > > of these. > > > > TASK_LOST is special in the sense that is not generated by the executor, > > but by the slave/master. You could think of it as an exception in mesos. > > Clearly, these should be rare in a stable mesos system. > > > > What do your logs say about the TASK_LOSTs? Is it always the same issue? > > Are you running w/ cgroups? > > > > > > > > On Fri, May 17, 2013 at 2:04 PM, David Greenberg <[email protected] > > >wrote: > > > > > Hello! Today I began working on a more advanced version of mesos-submit > > > that will handle hot-spares. > > > > > > I was assuming that TASK_{FAILED,FINISHED,LOST,KILLED} were the status > > > updates that meant that I needed to start a new spare process, as the > > > monitored task was killed. However, I noticed that I often recieved > > > TASK_LOSTs, and every 5 seconds, my scheduler would think its tasks had > > all > > > died, so it'd restart too many. Nevertheless, the tasks would reappear > > > later on, and I could see them in the web interface of Mesos, > continuing > > to > > > run. > > > > > > What is going on? > > > > > > Thanks! > > > David > > > > > >
