Has anyone else seen this behavior? I have a python implemented executor
and framework. Currently using 0.90 from the website. The end application
submits approximately 45K+ tasks to the framework for scheduling. Due to a
bug in my tasks, they fail immediately. It is still in the process of
submitting/failing when mesos-slave crashes and a netstat -tp indicates a
very large number of sockets in TIME_WAIT (between the single node and the
master) that belong to mesos-master. The source port is random (44700 in
the last run). The tasks only last about 1-2 seconds.
I'm assuming mesos-slave is crashing because it can't connect to the master
any more after source port exhaustion. It seems to me that the framework is
opening a new connection to mesos-master fairly frequently for task
status/submission. Maybe slave->master as well.
Fixing my own bug in the task, it works ok because the tasks finish in 1-2
seconds each, but there are still a fairly high number of TIME_WAIT sockets
indicating the problem is still there.
The last relevent mesos-slave crash lines:
F1211 08:41:41.626716 26415 process.cpp:1742] Check failed:
sockets.count(s) > 0
*** Check failure stack trace: ***
@ 0x7fc9e39adebd google::LogMessage::Fail()
@ 0x7fc9e39b064f google::LogMessage::SendToLog()
@ 0x7fc9e39adabb google::LogMessage::Flush()
@ 0x7fc9e39b0edd google::LogMessageFatal::~LogMessageFatal()
@ 0x7fc9e38e5579 process::SocketManager::next()
@ 0x7fc9e38e0063 process::send_data()
@ 0x7fc9e39eb66f ev_invoke_pending
@ 0x7fc9e39ef9a4 ev_loop
@ 0x7fc9e38e0fb7 process::serve()
@ 0x7fc9e32f1e9a start_thread
@ 0x7fc9e2b08cbd (unknown)
Aborted
On a side note, I'm anxious to see the changelog for the next release.
Aaron