Hi Aaron, here's what I know about this particular issue: Here's the bug: https://issues.apache.org/jira/browse/MESOS-220 Here's the fix (not in 0.9.0): https://reviews.apache.org/r/5995
We're planning on releasing 0.10.0 shortly, where the fix is present. On Wed, Dec 12, 2012 at 10:47 AM, Aaron Klingaman < [email protected]> wrote: > It appears the status update messages between the master/slave aren't > keeping the connections open. > > This is the only data transferred on each of the TIME_WAIT connections > before being closed: > > POST /slave/mesos.internal.StatusUpdateAcknowledgementMessage HTTP/1.0 > User-Agent: libprocess/[email protected]:36675 > Connection: Keep-Alive > Transfer-Encoding: chunked > > 8b > > % > #2012121210221931258048-5050-14193-4( > &2012121210221931258048-5050-14193-0001& > $4046162e-448a-11e2-9aa3-080027c264fa"'$FH+*s- > 0 > > I'll keep digging; any tips are appreciated. > > Aaron Klingaman > R&D Manager, Sr Architect > Urban Robotics, Inc. > 503-539-3693 > > > > > On Tue, Dec 11, 2012 at 4:31 PM, Aaron Klingaman < > [email protected]> wrote: > > > Update: > > > > There is something more going on that just a local port exhaustion. I > set: > > > > /proc/sys/net/ipv4/tcp_fin_timeout to 2 > > /proc/sys/net/ipv4/ip_local_port_range to 32768 65535 (+5K approx) > > > > and I'm still seing crashes. I'm currently looking for some artificial > > limit inside mesos on the maximum number of sockets employed. Is there > one? > > > > Much appreciated, > > > > Aaron > > > > > > Aaron Klingaman > > R&D Manager, Sr Architect > > Urban Robotics, Inc. > > 503-539-3693 > > > > > > > > > > On Tue, Dec 11, 2012 at 9:14 AM, Aaron Klingaman < > > [email protected]> wrote: > > > >> Has anyone else seen this behavior? I have a python implemented executor > >> and framework. Currently using 0.90 from the website. The end > application > >> submits approximately 45K+ tasks to the framework for scheduling. Due > to a > >> bug in my tasks, they fail immediately. It is still in the process of > >> submitting/failing when mesos-slave crashes and a netstat -tp indicates > a > >> very large number of sockets in TIME_WAIT (between the single node and > the > >> master) that belong to mesos-master. The source port is random (44700 in > >> the last run). The tasks only last about 1-2 seconds. > >> > >> I'm assuming mesos-slave is crashing because it can't connect to the > >> master any more after source port exhaustion. It seems to me that the > >> framework is opening a new connection to mesos-master fairly frequently > for > >> task status/submission. Maybe slave->master as well. > >> > >> Fixing my own bug in the task, it works ok because the tasks finish in > >> 1-2 seconds each, but there are still a fairly high number of TIME_WAIT > >> sockets indicating the problem is still there. > >> > >> The last relevent mesos-slave crash lines: > >> > >> F1211 08:41:41.626716 26415 process.cpp:1742] Check failed: > >> sockets.count(s) > 0 > >> *** Check failure stack trace: *** > >> @ 0x7fc9e39adebd google::LogMessage::Fail() > >> @ 0x7fc9e39b064f google::LogMessage::SendToLog() > >> @ 0x7fc9e39adabb google::LogMessage::Flush() > >> @ 0x7fc9e39b0edd google::LogMessageFatal::~LogMessageFatal() > >> @ 0x7fc9e38e5579 process::SocketManager::next() > >> @ 0x7fc9e38e0063 process::send_data() > >> @ 0x7fc9e39eb66f ev_invoke_pending > >> @ 0x7fc9e39ef9a4 ev_loop > >> @ 0x7fc9e38e0fb7 process::serve() > >> @ 0x7fc9e32f1e9a start_thread > >> @ 0x7fc9e2b08cbd (unknown) > >> Aborted > >> > >> On a side note, I'm anxious to see the changelog for the next release. > >> > >> Aaron > >> > >> > > >
