Re: Question about TASK_LOST statuses

David Greenberg Wed, 12 Jun 2013 07:08:05 -0700

I am on 0.12 right now, git revision
3758114ee4492dcbb784d5aac65d43ac54ddb439 (same as airbnb/chronos recomends).


I've the master and slave logs are 1.7MB bz2'ed, but apache.org's mailer
doesn't accept such large messages. I've sent them directly to VInod, and I
can send them to anyone else who asks.

I'm just running mesos w/ --conf, and the config is

master = zk://iadv1.pit.mycompany.com:2181,iadv2.pit.mycompany.com:2181,
iadv3.pit.mycompany.com:2181,iadv4.pit.mycompany.com:2181,
iadv5.pit.mycompany.com:2181/mesos
zk = zk://iadv1.pit.mycompany.com:2181,iadv2.pit.mycompany.com:2181,
iadv3.pit.mycompany.com:2181,iadv4.pit.mycompany.com:2181,
iadv5.pit.mycompany.com:2181/mesos
log_dir = /data/scratch/local/mesos/logs
work_dir = /data/scratch/local/mesos/work


I would be happy to move to the latest version that's likely stable, but
even after reading all of the discussion over the past couple weeks on
0.11, 0.12, and 0.13, I have no idea whether I should pick one of those,
HEAD, or some other commit.

Thank you!


On Wed, Jun 12, 2013 at 10:01 AM, David Greenberg <[email protected]>wrote:

> I am on 0.12 right now, git revision
> 3758114ee4492dcbb784d5aac65d43ac54ddb439 (same as airbnb/chronos recomends).
>
> I've attached the master and slave logs. I'm just running mesos w/ --conf,
> and the config is
>
> master = zk://iadv1.pit.mycompany.com:2181,iadv2.pit.mycompany.com:2181,
> iadv3.pit.mycompany.com:2181,iadv4.pit.mycompany.com:2181,
> iadv5.pit.mycompany.com:2181/mesos
> zk = zk://iadv1.pit.mycompany.com:2181,iadv2.pit.mycompany.com:2181,
> iadv3.pit.mycompany.com:2181,iadv4.pit.mycompany.com:2181,
> iadv5.pit.mycompany.com:2181/mesos
> log_dir = /data/scratch/local/mesos/logs
> work_dir = /data/scratch/local/mesos/work
>
>
> I would be happy to move to the latest version that's likely stable, but
> even after reading all of the discussion over the past couple weeks on
> 0.11, 0.12, and 0.13, I have no idea whether I should pick one of those,
> HEAD, or some other commit.
>
> Thank you!
>
> On Tue, Jun 11, 2013 at 4:53 PM, Vinod Kone <[email protected]> wrote:
>
>> What version of mesos are you running? Some logs and command lines would
>> be
>> great to debug here.
>>
>>
>> On Tue, Jun 11, 2013 at 1:48 PM, David Greenberg <[email protected]
>> >wrote:
>>
>> > So, I got a new, fresh Mac that I installed Mesos, Zookeeper, and other
>> > local processes on.
>> >
>> > When I test the code locally on the Mac (which has intermittant network
>> > connectivity), it runs fine for several minutes, then crashes OSX (the
>> > machine hardlocks), which is strange because I don't observe CPU or
>> memory
>> > spikes, or network events (which I'm logging at 1s intervals).
>> >
>> > When I run the code on the cluster (which is Ubuntu Linux based), I
>> still
>> > see a huge number of TASK_LOST messages, and the framework fails to have
>> > any tasks successfully run.
>> >
>> > What do you think the next steps are to debug this? Could it be a lossy
>> > network, or a misconfiguration of the slaves or the master or zookeeper?
>> >
>> > Thank you!
>> >
>> >
>> > On Mon, Jun 3, 2013 at 4:48 PM, Benjamin Mahler
>> > <[email protected]>wrote:
>> >
>> > > Yes, the Python bindings are still supported.
>> > >
>> > > Can you dump the DebugString of the TaskInfo you're constructing, to
>> > > confirm the SlaveID looks ok?
>> > >
>> > > Ben
>> > >
>> > >
>> > > On Tue, May 28, 2013 at 7:06 AM, David Greenberg <
>> [email protected]
>> > > >wrote:
>> > >
>> > > > Sorry for the delayed response--I'm having some issues w/ email
>> > delivery
>> > > to
>> > > > gmail...
>> > > >
>> > > > I'm trying to use the Python binding in this application. I am
>> copying
>> > > from
>> > > > offer.slave_id.value to task.slave_id.value using the = operator.
>> > > >
>> > > > Is the python binding still supported? Either way, due to some new
>> > > > concurrency requirements, I'm going to be shifting gears into
>> writing a
>> > > > JVM-based Mesos framework now.
>> > > >
>> > > > Thanks!
>> > > >
>> > > >
>> > > > On Thu, May 23, 2013 at 1:02 PM, Vinod Kone <[email protected]>
>> > wrote:
>> > > >
>> > > > > ---------- Forwarded message ----------
>> > > > > From: Vinod Kone <[email protected]>
>> > > > > Date: Sun, May 19, 2013 at 6:56 PM
>> > > > > Subject: Re: Question about TASK_LOST statuses
>> > > > > To: "[email protected]" <
>> [email protected]
>> > >
>> > > > >
>> > > > >
>> > > > > On the master's logs, I see this:
>> > > > >
>> > > > > > - 5600+ instances of "Error validating task XXX: Task uses
>> invalid
>> > > > slave:
>> > > > > > SOME_UUID"
>> > > > > >
>> > > > > What do you think the problem is? I am copying the slave_id from
>> the
>> > > > offer
>> > > > > > into the TaskInfo protobuf.
>> > > > > >
>> > > > > >
>> > > > > This will happen if the slave id in the task doesn't match the
>> slave
>> > id
>> > > > in
>> > > > > the slave. Are you sure you are doing the copying the right slave
>> ids
>> > > to
>> > > > > the right tasks? Looks like there is a mismatch. Maybe some
>> > > logs/printfs
>> > > > on
>> > > > > your scheduler, when you launch tasks, can point out the issue.
>> > > > >
>> > > > >
>> > > > >
>> > > > > > I'm using the process-based isolation at the moment (I haven't
>> had
>> > > the
>> > > > > time
>> > > > > > to set up the cgroups isolation yet).
>> > > > > >
>> > > > > > I can find and share whatever else is needed so that we can
>> figure
>> > > out
>> > > > > why
>> > > > > > these messages are occurring.
>> > > > > >
>> > > > > > Thanks,
>> > > > > > David
>> > > > > >
>> > > > > >
>> > > > > > On Fri, May 17, 2013 at 5:16 PM, Vinod Kone <
>> [email protected]>
>> > > > wrote:
>> > > > > >
>> > > > > > > Hi David,
>> > > > > > >
>> > > > > > > You are right in that all these status updates are what we
>> call
>> > > > > > "terminal"
>> > > > > > > status updates and mesos takes specific actions when it
>> > > > gets/generates
>> > > > > > one
>> > > > > > > of these.
>> > > > > > >
>> > > > > > > TASK_LOST is special in the sense that is not generated by the
>> > > > > executor,
>> > > > > > > but by the slave/master. You could think of it as an
>> exception in
>> > > > > mesos.
>> > > > > > > Clearly, these should be rare in a stable mesos system.
>> > > > > > >
>> > > > > > > What do your logs say about the TASK_LOSTs? Is it always the
>> same
>> > > > > issue?
>> > > > > > > Are you running w/ cgroups?
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > > On Fri, May 17, 2013 at 2:04 PM, David Greenberg <
>> > > > > [email protected]
>> > > > > > > >wrote:
>> > > > > > >
>> > > > > > > > Hello! Today I began working on a more advanced version of
>> > > > > mesos-submit
>> > > > > > > > that will handle hot-spares.
>> > > > > > > >
>> > > > > > > > I was assuming that TASK_{FAILED,FINISHED,LOST,KILLED} were
>> the
>> > > > > status
>> > > > > > > > updates that meant that I needed to start a new spare
>> process,
>> > as
>> > > > the
>> > > > > > > > monitored task was killed. However, I noticed that I often
>> > > recieved
>> > > > > > > > TASK_LOSTs, and every 5 seconds, my scheduler would think
>> its
>> > > tasks
>> > > > > had
>> > > > > > > all
>> > > > > > > > died, so it'd restart too many. Nevertheless, the tasks
>> would
>> > > > > reappear
>> > > > > > > > later on, and I could see them in the web interface of
>> Mesos,
>> > > > > > continuing
>> > > > > > > to
>> > > > > > > > run.
>> > > > > > > >
>> > > > > > > > What is going on?
>> > > > > > > >
>> > > > > > > > Thanks!
>> > > > > > > > David
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Re: Question about TASK_LOST statuses

Reply via email to