So, I got a new, fresh Mac that I installed Mesos, Zookeeper, and other local processes on.
When I test the code locally on the Mac (which has intermittant network connectivity), it runs fine for several minutes, then crashes OSX (the machine hardlocks), which is strange because I don't observe CPU or memory spikes, or network events (which I'm logging at 1s intervals). When I run the code on the cluster (which is Ubuntu Linux based), I still see a huge number of TASK_LOST messages, and the framework fails to have any tasks successfully run. What do you think the next steps are to debug this? Could it be a lossy network, or a misconfiguration of the slaves or the master or zookeeper? Thank you! On Mon, Jun 3, 2013 at 4:48 PM, Benjamin Mahler <[email protected]>wrote: > Yes, the Python bindings are still supported. > > Can you dump the DebugString of the TaskInfo you're constructing, to > confirm the SlaveID looks ok? > > Ben > > > On Tue, May 28, 2013 at 7:06 AM, David Greenberg <[email protected] > >wrote: > > > Sorry for the delayed response--I'm having some issues w/ email delivery > to > > gmail... > > > > I'm trying to use the Python binding in this application. I am copying > from > > offer.slave_id.value to task.slave_id.value using the = operator. > > > > Is the python binding still supported? Either way, due to some new > > concurrency requirements, I'm going to be shifting gears into writing a > > JVM-based Mesos framework now. > > > > Thanks! > > > > > > On Thu, May 23, 2013 at 1:02 PM, Vinod Kone <[email protected]> wrote: > > > > > ---------- Forwarded message ---------- > > > From: Vinod Kone <[email protected]> > > > Date: Sun, May 19, 2013 at 6:56 PM > > > Subject: Re: Question about TASK_LOST statuses > > > To: "[email protected]" <[email protected]> > > > > > > > > > On the master's logs, I see this: > > > > > > > - 5600+ instances of "Error validating task XXX: Task uses invalid > > slave: > > > > SOME_UUID" > > > > > > > What do you think the problem is? I am copying the slave_id from the > > offer > > > > into the TaskInfo protobuf. > > > > > > > > > > > This will happen if the slave id in the task doesn't match the slave id > > in > > > the slave. Are you sure you are doing the copying the right slave ids > to > > > the right tasks? Looks like there is a mismatch. Maybe some > logs/printfs > > on > > > your scheduler, when you launch tasks, can point out the issue. > > > > > > > > > > > > > I'm using the process-based isolation at the moment (I haven't had > the > > > time > > > > to set up the cgroups isolation yet). > > > > > > > > I can find and share whatever else is needed so that we can figure > out > > > why > > > > these messages are occurring. > > > > > > > > Thanks, > > > > David > > > > > > > > > > > > On Fri, May 17, 2013 at 5:16 PM, Vinod Kone <[email protected]> > > wrote: > > > > > > > > > Hi David, > > > > > > > > > > You are right in that all these status updates are what we call > > > > "terminal" > > > > > status updates and mesos takes specific actions when it > > gets/generates > > > > one > > > > > of these. > > > > > > > > > > TASK_LOST is special in the sense that is not generated by the > > > executor, > > > > > but by the slave/master. You could think of it as an exception in > > > mesos. > > > > > Clearly, these should be rare in a stable mesos system. > > > > > > > > > > What do your logs say about the TASK_LOSTs? Is it always the same > > > issue? > > > > > Are you running w/ cgroups? > > > > > > > > > > > > > > > > > > > > On Fri, May 17, 2013 at 2:04 PM, David Greenberg < > > > [email protected] > > > > > >wrote: > > > > > > > > > > > Hello! Today I began working on a more advanced version of > > > mesos-submit > > > > > > that will handle hot-spares. > > > > > > > > > > > > I was assuming that TASK_{FAILED,FINISHED,LOST,KILLED} were the > > > status > > > > > > updates that meant that I needed to start a new spare process, as > > the > > > > > > monitored task was killed. However, I noticed that I often > recieved > > > > > > TASK_LOSTs, and every 5 seconds, my scheduler would think its > tasks > > > had > > > > > all > > > > > > died, so it'd restart too many. Nevertheless, the tasks would > > > reappear > > > > > > later on, and I could see them in the web interface of Mesos, > > > > continuing > > > > > to > > > > > > run. > > > > > > > > > > > > What is going on? > > > > > > > > > > > > Thanks! > > > > > > David > > > > > > > > > > > > > > > > > > > > >
