>> >> I haven't mentioned it yet, but several times I've seen the website
>> >> perform fine all day until I browse to it myself and then all of a
>> >> sudden it's super slow for me and my third-party monitor.  WTF???
>> >
>> > I had a similar problems once when routing through a IPsec VPN
>> > tunnnel. I needed to reduce MTU in front of the tunnel to make it
>> > work correctly. But I think your problem is different.
>>
>>
>> I'm not using IPsec or a VPN.
>>
>>
>> > Does the http server backlog on the other side? Do you have
>> > performance graphs for other parts of the system to see them in
>> > relation? Maybe some router on the path doesn't work as expected.
>>
>>
>> I've attached a graph of http response time, CPU usage, and TCP
>> queueing over the past week.  It seems clear from watching top, iotop,
>> and free than my CPU is always the bottleneck on my server.
>
> What kind of application stack is running in the http server? CPU is a
> bottleneck you cannot always circumvent by throwing more CPUs at the
> problem. Maybe that stack needs tuning...
>
> At the point when requests start queuing up in the http server, the load
> on the server will exponentially rise. It's like a traffic jam on a
> multi lane high way. If one car brakes, thinks may still work. If a car
> in every lane brakes, you suddenly have a huge traffic jam backlogging
> a few miles. And it takes time to recover from that. You need to solve
> the cause for "braking" in the first place and add some alternative
> routes for "cars that never brake" (static files and cacheable
> content). Each lane corresponds to one CPU. Adding just more lanes when
> you have just 4 CPUs will only make the lanes slower. The key is to
> drastically lower the response times which are much too high if I look
> at your graphs. What do memory and IO say?


It turned out this was a combination of two problems which made it
much more difficult to figure out.

First of all I didn't have enough apache2 processes.  That seems like
it should have been obvious but it wasn't for two reasons.  Firstly,
my apache2 processes are always idle or nearly idle, even when traffic
levels are high.  But it must be the case that each request made to
nginx which is then handed off to apache2 monopolizes an apache2
process even though my backend application server is the one using all
the CPU instead of apache2.  The other thing that made it difficult to
track down was the way munin graphs apache2 processes.  On my graph,
busy and free processes only appeared as tiny dots at the bottom
because apache2's ServerLimit is drawn on the same graph which is many
times greater than the number of busy and free processes.  It would be
better to draw MaxClients instead of ServerLimit since I think
MaxClients is more likely to be tuned.  It at least appears in the
default config file on Gentoo.  Since busy and free apache2 processes
were virtually invisible on the munin graph, I wasn't able to
correlate their ebb and flow with my server's response times.

Once I fixed the apache2 problem, I was sure I had it nailed.  That's
when I emailed here a few days ago to say I think I got it.  But it
turned out there was another problem and that was Odoo (formerly known
as OpenERP) which is also running in a reverse proxy configuration
behind nginx.  Whenever someone uses Odoo on my server, it absolutely
destroys performance for my non-Odoo website.  That would have been
really easy to test and I did test stopping the odoo service early on,
but I ruled it out when the problem persisted after stopping Odoo
which I now realize must have been because of the apache2 problem.

So this was much more difficult to figure out due to the fact that I
had multiple problems interacting with each other.

- Grant

Reply via email to