Hi Paolo, Nico. Another piece of the puzzle?
On Thu, 16 Jul 2009 15:33:16 +0200 Stefan Schmiedl <[email protected]> wrote: > > The students of class 6a logged into their Windows domain accounts, > started Firefox and entered the URL for the test (stage 1 above). > Then they entered their names into the registration page (stage 2) > and clicked on the button to access the test. Shortly after server > CPU load went to 100% with the following error message being repeated > as fast as the remote terminal could cope with: > > "Socket accept error: Error while trying to accept a socket connection" > > Client side a one-liner 500 error message was reported. > > Time for pkill gst-remote ... I rebuilt the image and started the > server again. This time we staged the 25 "almost simultaneous" login > attempts into four batches of 6 each and things worked fine from that point > on. > > After finishing the test, the students logged off and the next class, 6b ... > had the exact same experience ... and 6c and 6d, too. > > For the final group I tried a different approach: > They logged on, opened the URL, and sat on their hands. > I killed gst-remote, rebuilt the image, restarted gst-remote and told them > to reload the page. They then entered their names and started clicking on > the answers and the Socket error of Doom appeared again. Kill, rebuild, > restart. Everybody loads the registration page (not staged, just 25 students > clicking when they're ready), enters their name and works on the test as it > should be. No hiccup. While I have not yet managed to reproduce the error message through a ruby mechanize script, I have noticed something suspicious: Start the server, check sockets on the server server # netstat -n | grep 4080 server # Run a mechanize script performing a few requests on the client. The script fetches the first page and the referenced css and js files. client $ ruby mech.rb 1 client $ Look at sockets on client client $ netstat -n | grep 4080 tcp 0 0 192.168.1.5:37021 88.198.5.34:4080 FIN_WAIT2 Look at sockets on server server # netstat -n | grep 4080 tcp 0 0 88.198.5.34:4080 93.223.36.238:37021 CLOSE_WAIT Wait about 10 min .... (typing this text) Look at sockets on client client $ netstat -n | grep 4080 client $ Look at sockets on server server # netstat -n | grep 4080 tcp 0 0 88.198.5.34:4080 93.223.36.238:37021 CLOSE_WAIT Run mechanize script again: client $ ruby mech.rb 1 client $ Sockets on client: client $ netstat -n | grep 4080 tcp 0 0 192.168.1.5:57747 88.198.5.34:4080 FIN_WAIT2 Sockets on server: server # netstat -n | grep 4080 tcp 0 0 88.198.5.34:4080 93.223.36.238:37021 CLOSE_WAIT tcp 0 0 88.198.5.34:4080 93.223.36.238:57747 CLOSE_WAIT soooo.... the problem described above has nothing to do with timing issues, but instead resource exhaustion due to _too many_ open sockets in CLOSE_WAIT state. Note also that the problem is heavily exacerbated when the app is accessed through an apache proxy as was done in the test session. In this scenario, running the same requests as above client $ ruby mech.rb 1 results in the following server-side mess: server # netstat -n | grep 4080 tcp 0 0 127.0.0.1:4080 127.0.0.1:57163 CLOSE_WAIT tcp 0 0 127.0.0.1:4080 127.0.0.1:57157 CLOSE_WAIT tcp 0 0 127.0.0.1:57155 127.0.0.1:4080 FIN_WAIT2 tcp 0 0 127.0.0.1:4080 127.0.0.1:57156 CLOSE_WAIT tcp 0 0 127.0.0.1:57163 127.0.0.1:4080 FIN_WAIT2 tcp 0 0 127.0.0.1:57157 127.0.0.1:4080 FIN_WAIT2 tcp 0 0 127.0.0.1:4080 127.0.0.1:57161 CLOSE_WAIT tcp 0 0 127.0.0.1:57153 127.0.0.1:4080 FIN_WAIT2 tcp 0 0 127.0.0.1:57161 127.0.0.1:4080 FIN_WAIT2 tcp 0 0 127.0.0.1:4080 127.0.0.1:57155 CLOSE_WAIT tcp 0 0 127.0.0.1:57162 127.0.0.1:4080 TIME_WAIT tcp 0 0 127.0.0.1:4080 127.0.0.1:57159 CLOSE_WAIT tcp 0 0 127.0.0.1:57154 127.0.0.1:4080 FIN_WAIT2 tcp 0 0 127.0.0.1:4080 127.0.0.1:57158 CLOSE_WAIT tcp 0 0 127.0.0.1:57156 127.0.0.1:4080 FIN_WAIT2 tcp 0 0 127.0.0.1:57160 127.0.0.1:4080 FIN_WAIT2 tcp 0 0 127.0.0.1:57158 127.0.0.1:4080 FIN_WAIT2 tcp 0 0 127.0.0.1:4080 127.0.0.1:57160 CLOSE_WAIT tcp 0 0 127.0.0.1:57159 127.0.0.1:4080 FIN_WAIT2 tcp 0 0 127.0.0.1:4080 127.0.0.1:57153 CLOSE_WAIT tcp 0 0 127.0.0.1:4080 127.0.0.1:57154 CLOSE_WAIT The FIN_WAIT2 sockets will disappear after a while, the *10* CLOSE_WAIT sockets won't. And since they are already closed, they won't be reused either, AFAICT. Now look at what google found for me: http://www.sunmanagers.org/pipermail/summaries/2006-January/007068.html I think, one of swazoo/sport/socket needs a behavioral readjustment. s. _______________________________________________ help-smalltalk mailing list [email protected] http://lists.gnu.org/mailman/listinfo/help-smalltalk
