Re: [Help-smalltalk] the test test - an experience report

Stefan Schmiedl Wed, 22 Jul 2009 14:15:01 -0700

Hi Paolo, Nico.

Another piece of the puzzle?


On Thu, 16 Jul 2009 15:33:16 +0200
Stefan Schmiedl <[email protected]> wrote:

> 
> The students of class 6a logged into their Windows domain accounts, 
> started Firefox and entered the URL for the test (stage 1 above).
> Then they entered their names into the registration page (stage 2)
> and clicked on the button to access the test. Shortly after server
> CPU load went to 100% with the following error message being repeated
> as fast as the remote terminal could cope with:
> 
>   "Socket accept error: Error while trying to accept a socket connection"
> 
> Client side a one-liner 500 error message was reported.
> 
> Time for pkill gst-remote ... I rebuilt the image and started the
> server again. This time we staged the 25 "almost simultaneous" login
> attempts into four batches of 6 each and things worked fine from that point 
> on.
> 
> After finishing the test, the students logged off and the next class, 6b ...
> had the exact same experience ... and 6c and 6d, too.
> 
> For the final group I tried a different approach:
> They logged on, opened the URL, and sat on their hands.
> I killed gst-remote, rebuilt the image, restarted gst-remote and told them
> to reload the page. They then entered their names and started clicking on
> the answers and the Socket error of Doom appeared again. Kill, rebuild,
> restart. Everybody loads the registration page (not staged, just 25 students
> clicking when they're ready), enters their name and works on the test as it 
> should be. No hiccup.

While I have not yet managed to reproduce the error message through
a ruby mechanize script, I have noticed something suspicious:

Start the server, check sockets on the server
  server # netstat -n | grep 4080
  server # 

Run a mechanize script performing a few requests on the client.
The script fetches the first page and the referenced css and js files.
  client $ ruby mech.rb 1
  client $

Look at sockets on client
  client $ netstat -n | grep 4080
  tcp        0      0 192.168.1.5:37021       88.198.5.34:4080        FIN_WAIT2 
 

Look at sockets on server
  server # netstat -n | grep 4080
  tcp        0      0 88.198.5.34:4080        93.223.36.238:37021     
CLOSE_WAIT 

Wait about 10 min .... (typing this text)

Look at sockets on client
  client $ netstat -n | grep 4080
  client $

Look at sockets on server
  server # netstat -n | grep 4080
  tcp        0      0 88.198.5.34:4080        93.223.36.238:37021     
CLOSE_WAIT 

Run mechanize script again:
  client $ ruby mech.rb 1
  client $

Sockets on client:
  client $ netstat -n | grep 4080
  tcp        0      0 192.168.1.5:57747       88.198.5.34:4080        FIN_WAIT2 
 

Sockets on server:
server # netstat -n | grep 4080
tcp        0      0 88.198.5.34:4080        93.223.36.238:37021     CLOSE_WAIT 
tcp        0      0 88.198.5.34:4080        93.223.36.238:57747     CLOSE_WAIT 

soooo.... the problem described above has nothing to do with timing issues,
but instead resource exhaustion due to _too many_ open sockets in CLOSE_WAIT
state.

Note also that the problem is heavily exacerbated when the app is accessed
through an apache proxy as was done in the test session. In this scenario,
running the same requests as above
  client $ ruby mech.rb 1

results in the following server-side mess:
  server # netstat -n | grep 4080
  tcp        0      0 127.0.0.1:4080          127.0.0.1:57163         
CLOSE_WAIT 
  tcp        0      0 127.0.0.1:4080          127.0.0.1:57157         
CLOSE_WAIT 
  tcp        0      0 127.0.0.1:57155         127.0.0.1:4080          FIN_WAIT2 
 
  tcp        0      0 127.0.0.1:4080          127.0.0.1:57156         
CLOSE_WAIT 
  tcp        0      0 127.0.0.1:57163         127.0.0.1:4080          FIN_WAIT2 
 
  tcp        0      0 127.0.0.1:57157         127.0.0.1:4080          FIN_WAIT2 
 
  tcp        0      0 127.0.0.1:4080          127.0.0.1:57161         
CLOSE_WAIT 
  tcp        0      0 127.0.0.1:57153         127.0.0.1:4080          FIN_WAIT2 
 
  tcp        0      0 127.0.0.1:57161         127.0.0.1:4080          FIN_WAIT2 
 
  tcp        0      0 127.0.0.1:4080          127.0.0.1:57155         
CLOSE_WAIT 
  tcp        0      0 127.0.0.1:57162         127.0.0.1:4080          TIME_WAIT 
 
  tcp        0      0 127.0.0.1:4080          127.0.0.1:57159         
CLOSE_WAIT 
  tcp        0      0 127.0.0.1:57154         127.0.0.1:4080          FIN_WAIT2 
 
  tcp        0      0 127.0.0.1:4080          127.0.0.1:57158         
CLOSE_WAIT 
  tcp        0      0 127.0.0.1:57156         127.0.0.1:4080          FIN_WAIT2 
 
  tcp        0      0 127.0.0.1:57160         127.0.0.1:4080          FIN_WAIT2 
 
  tcp        0      0 127.0.0.1:57158         127.0.0.1:4080          FIN_WAIT2 
 
  tcp        0      0 127.0.0.1:4080          127.0.0.1:57160         
CLOSE_WAIT 
  tcp        0      0 127.0.0.1:57159         127.0.0.1:4080          FIN_WAIT2 
 
  tcp        0      0 127.0.0.1:4080          127.0.0.1:57153         
CLOSE_WAIT 
  tcp        0      0 127.0.0.1:4080          127.0.0.1:57154         
CLOSE_WAIT 

The FIN_WAIT2 sockets will disappear after a while, the *10* CLOSE_WAIT sockets 
won't.
And since they are already closed, they won't be reused either, AFAICT.

Now look at what google found for me:

http://www.sunmanagers.org/pipermail/summaries/2006-January/007068.html

I think, one of swazoo/sport/socket needs a behavioral readjustment.

s.


_______________________________________________
help-smalltalk mailing list
[email protected]
http://lists.gnu.org/mailman/listinfo/help-smalltalk

Re: [Help-smalltalk] the test test - an experience report

Reply via email to