On Wed, Aug 27, 2014 at 12:29 AM, Jeff Rogers <dv...@diphi.com> wrote: > > Do you know if this connection dropping happens mostly when there is a lot of > activity or more frequently when there is very low activity? > > I recall a few edge cases in the thread pooling where a thread would in some > circumstances wait until another connection came in before running, and there > might have been a related case where a connection could get dropped. IIRC, > these both happened generally when there was low traffic (or more > specifically low concurrent traffic). Playing with maxconns might diminish > the problem in this case.
There doesn't seem to be a correlation between the site load and the connection drops. We never really see very low activity - the design of the site means there are hundreds of thousands of pages, so we're constantly being crawled by every bot ever spawned (we serve about 8GB daily to googlebot alone). But we use fairly small Amazon EC2 instances that are primarily memory limited, so the config is tuned to reach a memory high water mark of around 700 - 800 MB, which based on the load testing I did found maxthreads of 6 and maxconnections of 50 (modules/tcl/pools.tcl seems to take the maxconnections ns_param as the ns_pools -maxconns parameter). The project was most recently moved from version 3.4 / Tcl 7.6 to 4.5 / Tcl 8.6, so there are a lot of legacy bits and shims in place. I recall something about the meaning of the maxconnections ns_param changing meaning between these versions, but it seems to be working as I would expect). Server load tends to vary between a loadavg of 0.4 to 2.0, typically around 0.6 (2 cores). Concurrency is pretty low, it seldom reaches 6, and usually sits at around 1 - 3, based on the monitoring we've currently got in place. I'm currently trying to reconstruct exact the thread / concurrency / request context for the connection drop events by parsing the log files, I'm hoping that might reveal some pattern in the failures. But so far they don't seem to correspond with connection lifecycle events. > You also mention favicon.ico; is it mostly or always that? It's notable for > being a small static file, which could point to other causes, like a corrupt > interpreter state as Peter suggested. Or there might be some weirdness with > mmap if you have that enabled. There doesn't seem to be a pattern to the failing requests, sometimes it's small static files like favicon.ico, but mostly not (although in our case we're not using fastpath for that - different favicons are served based on the request context). At the moment I'm leaning towards some sort of corrupted connection thread state - the failures tend to cluster by time, server, user - so that, although the failures are exceedingly rare overall (220 yesterday), it's often the case that a given user will have to reload a page several times before they get a successful response. The servers are fronted by haproxy which will tend to send a given session back to the same server. > One other thought, can you switch to naviserver? The connection handling > there has evolved somewhat differently not to mention more recently) than > aolserver, but programming-wise there are not a lot of differences. It's probably not out of the question if there is a strong argument to be made that it would fix the problem, we're taking quite a reputation hit at the moment. I initially attempted to make the site work on naviserver since that seemed to be more active, but I ran into problems with the nsdb / nsdbi change and segfaults when I tried to get nsdb working on it. It's also a 15 year old code base that seemed to be quite sensitive to the small config and api changes in naviserver, and the port from 3.4 / Tcl 7.6 was tricky enough as it was (encoding issues, list parsing differences, regexp syntax, etc) that the call was made to go with AOLserver 4.5 instead to minimize the changes required. But the site has been running well on the new version since March / April, so porting to naviserver should be feasible, but I'd need to make a very strong case. Cyan ------------------------------------------------------------------------------ Slashdot TV. Video for Nerds. Stuff that matters. http://tv.slashdot.org/ _______________________________________________ aolserver-talk mailing list aolserver-talk@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/aolserver-talk