Update: There is something more going on that just a local port exhaustion. I set:
/proc/sys/net/ipv4/tcp_fin_timeout to 2 /proc/sys/net/ipv4/ip_local_port_range to 32768 65535 (+5K approx) and I'm still seing crashes. I'm currently looking for some artificial limit inside mesos on the maximum number of sockets employed. Is there one? Much appreciated, Aaron Aaron Klingaman R&D Manager, Sr Architect Urban Robotics, Inc. 503-539-3693 On Tue, Dec 11, 2012 at 9:14 AM, Aaron Klingaman < [email protected]> wrote: > Has anyone else seen this behavior? I have a python implemented executor > and framework. Currently using 0.90 from the website. The end application > submits approximately 45K+ tasks to the framework for scheduling. Due to a > bug in my tasks, they fail immediately. It is still in the process of > submitting/failing when mesos-slave crashes and a netstat -tp indicates a > very large number of sockets in TIME_WAIT (between the single node and the > master) that belong to mesos-master. The source port is random (44700 in > the last run). The tasks only last about 1-2 seconds. > > I'm assuming mesos-slave is crashing because it can't connect to the > master any more after source port exhaustion. It seems to me that the > framework is opening a new connection to mesos-master fairly frequently for > task status/submission. Maybe slave->master as well. > > Fixing my own bug in the task, it works ok because the tasks finish in 1-2 > seconds each, but there are still a fairly high number of TIME_WAIT sockets > indicating the problem is still there. > > The last relevent mesos-slave crash lines: > > F1211 08:41:41.626716 26415 process.cpp:1742] Check failed: > sockets.count(s) > 0 > *** Check failure stack trace: *** > @ 0x7fc9e39adebd google::LogMessage::Fail() > @ 0x7fc9e39b064f google::LogMessage::SendToLog() > @ 0x7fc9e39adabb google::LogMessage::Flush() > @ 0x7fc9e39b0edd google::LogMessageFatal::~LogMessageFatal() > @ 0x7fc9e38e5579 process::SocketManager::next() > @ 0x7fc9e38e0063 process::send_data() > @ 0x7fc9e39eb66f ev_invoke_pending > @ 0x7fc9e39ef9a4 ev_loop > @ 0x7fc9e38e0fb7 process::serve() > @ 0x7fc9e32f1e9a start_thread > @ 0x7fc9e2b08cbd (unknown) > Aborted > > On a side note, I'm anxious to see the changelog for the next release. > > Aaron > >
