Update:

There is something more going on that just a local port exhaustion. I set:

/proc/sys/net/ipv4/tcp_fin_timeout to 2
/proc/sys/net/ipv4/ip_local_port_range to 32768 65535 (+5K approx)

and I'm still seing crashes. I'm currently looking for some artificial
limit inside mesos on the maximum number of sockets employed. Is there one?

Much appreciated,

Aaron


Aaron Klingaman
R&D Manager, Sr Architect
Urban Robotics, Inc.
503-539-3693




On Tue, Dec 11, 2012 at 9:14 AM, Aaron Klingaman <
[email protected]> wrote:

> Has anyone else seen this behavior? I have a python implemented executor
> and framework. Currently using 0.90 from the website. The end application
> submits approximately 45K+ tasks to the framework for scheduling. Due to a
> bug in my tasks, they fail immediately. It is still in the process of
> submitting/failing when mesos-slave crashes and a netstat -tp indicates a
> very large number of sockets in TIME_WAIT (between the single node and the
> master) that belong to mesos-master. The source port is random (44700 in
> the last run). The tasks only last about 1-2 seconds.
>
> I'm assuming mesos-slave is crashing because it can't connect to the
> master any more after source port exhaustion. It seems to me that the
> framework is opening a new connection to mesos-master fairly frequently for
> task status/submission. Maybe slave->master as well.
>
> Fixing my own bug in the task, it works ok because the tasks finish in 1-2
> seconds each, but there are still a fairly high number of TIME_WAIT sockets
> indicating the problem is still there.
>
> The last relevent mesos-slave crash lines:
>
> F1211 08:41:41.626716 26415 process.cpp:1742] Check failed:
> sockets.count(s) > 0
> *** Check failure stack trace: ***
>     @     0x7fc9e39adebd  google::LogMessage::Fail()
>     @     0x7fc9e39b064f  google::LogMessage::SendToLog()
>     @     0x7fc9e39adabb  google::LogMessage::Flush()
>     @     0x7fc9e39b0edd  google::LogMessageFatal::~LogMessageFatal()
>     @     0x7fc9e38e5579  process::SocketManager::next()
>     @     0x7fc9e38e0063  process::send_data()
>     @     0x7fc9e39eb66f  ev_invoke_pending
>     @     0x7fc9e39ef9a4  ev_loop
>     @     0x7fc9e38e0fb7  process::serve()
>     @     0x7fc9e32f1e9a  start_thread
>     @     0x7fc9e2b08cbd  (unknown)
> Aborted
>
> On a side note, I'm anxious to see the changelog for the next release.
>
> Aaron
>
>

Reply via email to