Hi Aaron, here's what I know about this particular issue:

Here's the bug: https://issues.apache.org/jira/browse/MESOS-220
Here's the fix (not in 0.9.0): https://reviews.apache.org/r/5995

We're planning on releasing 0.10.0 shortly, where the fix is present.

On Wed, Dec 12, 2012 at 10:47 AM, Aaron Klingaman <
[email protected]> wrote:

> It appears the status update messages between the master/slave aren't
> keeping the connections open.
>
> This is the only data transferred on each of the TIME_WAIT connections
> before being closed:
>
> POST /slave/mesos.internal.StatusUpdateAcknowledgementMessage HTTP/1.0
> User-Agent: libprocess/[email protected]:36675
> Connection: Keep-Alive
> Transfer-Encoding: chunked
>
> 8b
>
> %
> #2012121210221931258048-5050-14193-4(
> &2012121210221931258048-5050-14193-0001&
> $4046162e-448a-11e2-9aa3-080027c264fa"'$FH+*s-
> 0
>
> I'll keep digging; any tips are appreciated.
>
> Aaron Klingaman
> R&D Manager, Sr Architect
> Urban Robotics, Inc.
> 503-539-3693
>
>
>
>
> On Tue, Dec 11, 2012 at 4:31 PM, Aaron Klingaman <
> [email protected]> wrote:
>
> > Update:
> >
> > There is something more going on that just a local port exhaustion. I
> set:
> >
> > /proc/sys/net/ipv4/tcp_fin_timeout to 2
> > /proc/sys/net/ipv4/ip_local_port_range to 32768 65535 (+5K approx)
> >
> > and I'm still seing crashes. I'm currently looking for some artificial
> > limit inside mesos on the maximum number of sockets employed. Is there
> one?
> >
> > Much appreciated,
> >
> > Aaron
> >
> >
> > Aaron Klingaman
> > R&D Manager, Sr Architect
> > Urban Robotics, Inc.
> > 503-539-3693
> >
> >
> >
> >
> > On Tue, Dec 11, 2012 at 9:14 AM, Aaron Klingaman <
> > [email protected]> wrote:
> >
> >> Has anyone else seen this behavior? I have a python implemented executor
> >> and framework. Currently using 0.90 from the website. The end
> application
> >> submits approximately 45K+ tasks to the framework for scheduling. Due
> to a
> >> bug in my tasks, they fail immediately. It is still in the process of
> >> submitting/failing when mesos-slave crashes and a netstat -tp indicates
> a
> >> very large number of sockets in TIME_WAIT (between the single node and
> the
> >> master) that belong to mesos-master. The source port is random (44700 in
> >> the last run). The tasks only last about 1-2 seconds.
> >>
> >> I'm assuming mesos-slave is crashing because it can't connect to the
> >> master any more after source port exhaustion. It seems to me that the
> >> framework is opening a new connection to mesos-master fairly frequently
> for
> >> task status/submission. Maybe slave->master as well.
> >>
> >> Fixing my own bug in the task, it works ok because the tasks finish in
> >> 1-2 seconds each, but there are still a fairly high number of TIME_WAIT
> >> sockets indicating the problem is still there.
> >>
> >> The last relevent mesos-slave crash lines:
> >>
> >> F1211 08:41:41.626716 26415 process.cpp:1742] Check failed:
> >> sockets.count(s) > 0
> >> *** Check failure stack trace: ***
> >>     @     0x7fc9e39adebd  google::LogMessage::Fail()
> >>     @     0x7fc9e39b064f  google::LogMessage::SendToLog()
> >>     @     0x7fc9e39adabb  google::LogMessage::Flush()
> >>     @     0x7fc9e39b0edd  google::LogMessageFatal::~LogMessageFatal()
> >>     @     0x7fc9e38e5579  process::SocketManager::next()
> >>     @     0x7fc9e38e0063  process::send_data()
> >>     @     0x7fc9e39eb66f  ev_invoke_pending
> >>     @     0x7fc9e39ef9a4  ev_loop
> >>     @     0x7fc9e38e0fb7  process::serve()
> >>     @     0x7fc9e32f1e9a  start_thread
> >>     @     0x7fc9e2b08cbd  (unknown)
> >> Aborted
> >>
> >> On a side note, I'm anxious to see the changelog for the next release.
> >>
> >> Aaron
> >>
> >>
> >
>

Reply via email to