On Wed, Aug 27, 2014 at 12:29 AM, Jeff Rogers <dv...@diphi.com> wrote:
>
> Do you know if this connection dropping happens mostly when there is a lot of 
> activity or more frequently when there is very low activity?
>
> I recall a few edge cases in the thread pooling where a thread would in some 
> circumstances wait until another connection came in before running, and there 
> might have been a related case where a connection could get dropped.  IIRC, 
> these both happened generally when there was low traffic (or more 
> specifically low concurrent traffic).  Playing with maxconns might diminish 
> the problem in this case.

There doesn't seem to be a correlation between the site load and the
connection drops.  We never really see very low activity - the design
of the site means there are hundreds of thousands of pages, so we're
constantly being crawled by every bot ever spawned (we serve about 8GB
daily to googlebot alone).  But we use fairly small Amazon EC2
instances that are primarily memory limited, so the config is tuned to
reach a memory high water mark of around 700 - 800 MB, which based on
the load testing I did found maxthreads of 6 and maxconnections of 50
(modules/tcl/pools.tcl seems to take the maxconnections ns_param as
the ns_pools -maxconns parameter).  The project was most recently
moved from version 3.4 / Tcl 7.6 to 4.5 / Tcl 8.6, so there are a lot
of legacy bits and shims in place.  I recall something about the
meaning of the maxconnections ns_param changing meaning between these
versions, but it seems to be working as I would expect).

Server load tends to vary between a loadavg of 0.4 to 2.0, typically
around 0.6 (2 cores).  Concurrency is pretty low, it seldom reaches 6,
and usually sits at around 1 - 3, based on the monitoring we've
currently got in place.  I'm currently trying to reconstruct exact the
thread / concurrency / request context for the connection drop events
by parsing the log files, I'm hoping that might reveal some pattern in
the failures.  But so far they don't seem to correspond with
connection lifecycle events.

> You also mention favicon.ico; is it mostly or always that?  It's notable for 
> being a small static file, which could point to other causes, like a corrupt 
> interpreter state as Peter suggested.  Or there might be some weirdness with 
> mmap if you have that enabled.

There doesn't seem to be a pattern to the failing requests, sometimes
it's small static files like favicon.ico, but mostly not (although in
our case we're not using fastpath for that - different favicons are
served based on the request context).  At the moment I'm leaning
towards some sort of corrupted connection thread state - the failures
tend to cluster by time, server, user - so that, although the failures
are exceedingly rare overall (220 yesterday), it's often the case that
a given user will have to reload a page several times before they get
a successful response.  The servers are fronted by haproxy which will
tend to send a given session back to the same server.

> One other thought, can you switch to naviserver?  The connection handling 
> there has evolved somewhat differently not to mention more recently) than 
> aolserver, but programming-wise there are not a lot of differences.

It's probably not out of the question if there is a strong argument to
be made that it would fix the problem, we're taking quite a reputation
hit at the moment.  I initially attempted to make the site work on
naviserver since that seemed to be more active, but I ran into
problems with the nsdb / nsdbi change and segfaults when I tried to
get nsdb working on it.  It's also a 15 year old code base that seemed
to be quite sensitive to the small config and api changes in
naviserver, and the port from 3.4 / Tcl 7.6 was tricky enough as it
was (encoding issues, list parsing differences, regexp syntax, etc)
that the call was made to go with AOLserver 4.5 instead to minimize
the changes required.

But the site has been running well on the new version since March /
April, so porting to naviserver should be feasible, but I'd need to
make a very strong case.

Cyan

------------------------------------------------------------------------------
Slashdot TV.  
Video for Nerds.  Stuff that matters.
http://tv.slashdot.org/
_______________________________________________
aolserver-talk mailing list
aolserver-talk@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/aolserver-talk

Reply via email to