I am on 0.12 right now, git revision 3758114ee4492dcbb784d5aac65d43ac54ddb439 (same as airbnb/chronos recomends).
I've the master and slave logs are 1.7MB bz2'ed, but apache.org's mailer doesn't accept such large messages. I've sent them directly to VInod, and I can send them to anyone else who asks. I'm just running mesos w/ --conf, and the config is master = zk://iadv1.pit.mycompany.com:2181,iadv2.pit.mycompany.com:2181, iadv3.pit.mycompany.com:2181,iadv4.pit.mycompany.com:2181, iadv5.pit.mycompany.com:2181/mesos zk = zk://iadv1.pit.mycompany.com:2181,iadv2.pit.mycompany.com:2181, iadv3.pit.mycompany.com:2181,iadv4.pit.mycompany.com:2181, iadv5.pit.mycompany.com:2181/mesos log_dir = /data/scratch/local/mesos/logs work_dir = /data/scratch/local/mesos/work I would be happy to move to the latest version that's likely stable, but even after reading all of the discussion over the past couple weeks on 0.11, 0.12, and 0.13, I have no idea whether I should pick one of those, HEAD, or some other commit. Thank you! On Wed, Jun 12, 2013 at 10:01 AM, David Greenberg <[email protected]>wrote: > I am on 0.12 right now, git revision > 3758114ee4492dcbb784d5aac65d43ac54ddb439 (same as airbnb/chronos recomends). > > I've attached the master and slave logs. I'm just running mesos w/ --conf, > and the config is > > master = zk://iadv1.pit.mycompany.com:2181,iadv2.pit.mycompany.com:2181, > iadv3.pit.mycompany.com:2181,iadv4.pit.mycompany.com:2181, > iadv5.pit.mycompany.com:2181/mesos > zk = zk://iadv1.pit.mycompany.com:2181,iadv2.pit.mycompany.com:2181, > iadv3.pit.mycompany.com:2181,iadv4.pit.mycompany.com:2181, > iadv5.pit.mycompany.com:2181/mesos > log_dir = /data/scratch/local/mesos/logs > work_dir = /data/scratch/local/mesos/work > > > I would be happy to move to the latest version that's likely stable, but > even after reading all of the discussion over the past couple weeks on > 0.11, 0.12, and 0.13, I have no idea whether I should pick one of those, > HEAD, or some other commit. > > Thank you! > > On Tue, Jun 11, 2013 at 4:53 PM, Vinod Kone <[email protected]> wrote: > >> What version of mesos are you running? Some logs and command lines would >> be >> great to debug here. >> >> >> On Tue, Jun 11, 2013 at 1:48 PM, David Greenberg <[email protected] >> >wrote: >> >> > So, I got a new, fresh Mac that I installed Mesos, Zookeeper, and other >> > local processes on. >> > >> > When I test the code locally on the Mac (which has intermittant network >> > connectivity), it runs fine for several minutes, then crashes OSX (the >> > machine hardlocks), which is strange because I don't observe CPU or >> memory >> > spikes, or network events (which I'm logging at 1s intervals). >> > >> > When I run the code on the cluster (which is Ubuntu Linux based), I >> still >> > see a huge number of TASK_LOST messages, and the framework fails to have >> > any tasks successfully run. >> > >> > What do you think the next steps are to debug this? Could it be a lossy >> > network, or a misconfiguration of the slaves or the master or zookeeper? >> > >> > Thank you! >> > >> > >> > On Mon, Jun 3, 2013 at 4:48 PM, Benjamin Mahler >> > <[email protected]>wrote: >> > >> > > Yes, the Python bindings are still supported. >> > > >> > > Can you dump the DebugString of the TaskInfo you're constructing, to >> > > confirm the SlaveID looks ok? >> > > >> > > Ben >> > > >> > > >> > > On Tue, May 28, 2013 at 7:06 AM, David Greenberg < >> [email protected] >> > > >wrote: >> > > >> > > > Sorry for the delayed response--I'm having some issues w/ email >> > delivery >> > > to >> > > > gmail... >> > > > >> > > > I'm trying to use the Python binding in this application. I am >> copying >> > > from >> > > > offer.slave_id.value to task.slave_id.value using the = operator. >> > > > >> > > > Is the python binding still supported? Either way, due to some new >> > > > concurrency requirements, I'm going to be shifting gears into >> writing a >> > > > JVM-based Mesos framework now. >> > > > >> > > > Thanks! >> > > > >> > > > >> > > > On Thu, May 23, 2013 at 1:02 PM, Vinod Kone <[email protected]> >> > wrote: >> > > > >> > > > > ---------- Forwarded message ---------- >> > > > > From: Vinod Kone <[email protected]> >> > > > > Date: Sun, May 19, 2013 at 6:56 PM >> > > > > Subject: Re: Question about TASK_LOST statuses >> > > > > To: "[email protected]" < >> [email protected] >> > > >> > > > > >> > > > > >> > > > > On the master's logs, I see this: >> > > > > >> > > > > > - 5600+ instances of "Error validating task XXX: Task uses >> invalid >> > > > slave: >> > > > > > SOME_UUID" >> > > > > > >> > > > > What do you think the problem is? I am copying the slave_id from >> the >> > > > offer >> > > > > > into the TaskInfo protobuf. >> > > > > > >> > > > > > >> > > > > This will happen if the slave id in the task doesn't match the >> slave >> > id >> > > > in >> > > > > the slave. Are you sure you are doing the copying the right slave >> ids >> > > to >> > > > > the right tasks? Looks like there is a mismatch. Maybe some >> > > logs/printfs >> > > > on >> > > > > your scheduler, when you launch tasks, can point out the issue. >> > > > > >> > > > > >> > > > > >> > > > > > I'm using the process-based isolation at the moment (I haven't >> had >> > > the >> > > > > time >> > > > > > to set up the cgroups isolation yet). >> > > > > > >> > > > > > I can find and share whatever else is needed so that we can >> figure >> > > out >> > > > > why >> > > > > > these messages are occurring. >> > > > > > >> > > > > > Thanks, >> > > > > > David >> > > > > > >> > > > > > >> > > > > > On Fri, May 17, 2013 at 5:16 PM, Vinod Kone < >> [email protected]> >> > > > wrote: >> > > > > > >> > > > > > > Hi David, >> > > > > > > >> > > > > > > You are right in that all these status updates are what we >> call >> > > > > > "terminal" >> > > > > > > status updates and mesos takes specific actions when it >> > > > gets/generates >> > > > > > one >> > > > > > > of these. >> > > > > > > >> > > > > > > TASK_LOST is special in the sense that is not generated by the >> > > > > executor, >> > > > > > > but by the slave/master. You could think of it as an >> exception in >> > > > > mesos. >> > > > > > > Clearly, these should be rare in a stable mesos system. >> > > > > > > >> > > > > > > What do your logs say about the TASK_LOSTs? Is it always the >> same >> > > > > issue? >> > > > > > > Are you running w/ cgroups? >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > On Fri, May 17, 2013 at 2:04 PM, David Greenberg < >> > > > > [email protected] >> > > > > > > >wrote: >> > > > > > > >> > > > > > > > Hello! Today I began working on a more advanced version of >> > > > > mesos-submit >> > > > > > > > that will handle hot-spares. >> > > > > > > > >> > > > > > > > I was assuming that TASK_{FAILED,FINISHED,LOST,KILLED} were >> the >> > > > > status >> > > > > > > > updates that meant that I needed to start a new spare >> process, >> > as >> > > > the >> > > > > > > > monitored task was killed. However, I noticed that I often >> > > recieved >> > > > > > > > TASK_LOSTs, and every 5 seconds, my scheduler would think >> its >> > > tasks >> > > > > had >> > > > > > > all >> > > > > > > > died, so it'd restart too many. Nevertheless, the tasks >> would >> > > > > reappear >> > > > > > > > later on, and I could see them in the web interface of >> Mesos, >> > > > > > continuing >> > > > > > > to >> > > > > > > > run. >> > > > > > > > >> > > > > > > > What is going on? >> > > > > > > > >> > > > > > > > Thanks! >> > > > > > > > David >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> > >
