I hope the following is not too long and confusing ...

On 03.05.2011 22:02, Mark Thomas wrote:
Scenario
--------
This ended up being very long, so I moved it to the end. The exact
pattern of delays will vary depending on timeouts, request frequency
etc. but the scenario shows an example of how delays can occur. The
short version is that requests with data to process (particularly new
connections) tend to get delayed in the queue waiting for a thread to
process them when the threads are all tied up processing keep-alive
connections.

Root cause
----------
The underlying cause of all of the performance issues observed is when
the threads are tied up doing HTTP keep-alive when there is no data
process but there are other connections in the queue that do have data
that could be processed.

Solution A
----------
NIO is designed to handle this using a poller. That isn't available to
BIO so I attempted to simulate it. That generated excessive CPU load so
I do not think simulated polling is the tight solution.

I expect generating the SocketTimeoutException is expensive, because the JVM has to generate the stack information. The rate of the Exception when handling mostly keep-alive (extreme case) is your "poll" timeout times the number of threads, e.g. 100ms timeout times 200 threads is 2000 exceptions per second. Even if there is another reason for the high CPU load, I expect it to be roughly proportional to the poll rate. In a saturated system with lots of keep-alive you will have:

pollRate = 1 / pollTimeout * maxThreads
(e.g. 1 / 0.1s * 200 = 2000/s)
averageWaitBeforePoll = maxConnection / pollRate / 2
(e.g. 10000 / 2000 / 2 / s = 1.5s)

So we see, that in your case though we already have a high poll event rate, we end up with every connection only being polled every 2.5 seconds, which is too much of request latency. If we want to reduce this latency, we would need to increse the rate. But then CPU gets even worse. Or we need to reduce maxConnections.

Let us try a different sizing:

maxThreads 200 , maxConnections 1000 (less overcommitment, but still very different from 200), pollTimeout 200ms.

rate = 1000, half of the previous rate due to the doubled timeout.
averageWaitBeforePoll = 0.5 seconds.

Although this is an improvement, we still have a high poll rate and even 0.5 seconds average wait time for new connections isn't nice.

The tradeoff is: To be CPU effective, we have to reduce the poll rate. Assuming a fixed thread and connection count, this automatically leads to longer averageWaitBeforePoll, i.e. request latency. There seems to be no sweet spot for sizing the system.

If we do not find an efficient way (in terms of CPU and blocking time of threads) to handle the keep-alive connections, then I don't expect a solution to the problem - except for disabling keep-alive or not accepting much more connections than we have threads. At the end that's the 75% threads busy then disable keep-alive solution. One could throw in some "reduce keep-alive timeout under load" feature, but I doubt it will help much more than the simply solution.

Do we see a cpu time and thread blocking time efficient way of handling many keep-alive connections? I don't see any API, that would help here. Of course one could try to build a hyprid "blocking for normal processing but non-blocking for keep-alive" thing, but since we already have NIO I would also support recommending NIO for keep-alive.

Switching the default from BIO to NIO is a big change, but only after we switch will we find the last buglets and problems arising under rare conditions. So if we want to switch, we should do it very soon. Doing it late in the TC 7 cycle would be bad.

Lastly: APR uses server to server connections, as does HTTP when using a reverse proxy in front of Tomcat. In those cases we have much fewer connections with a higher rate of requests per connection. There maxThreads == maxConnections is fine (and even the 75% rule could be switched off). So for this scenario it would be nice to not drop BIO, at least until the major TC version after the default switched to NIO.

Solution B
----------
Return to the Tomcat 6 implementation where maxConnections == maxThreads.

Additional clean-up
-------------------
maxConnections is unnecessary in APR since pollerSize performs the same
function.

Summary
-------
The proposed changes are:
a) restore disabling keep-alive when threads used>= 75% of maxThreads
b) remove maxConnections and associated code from the APR connector
c) remove the configuration options for maxConnections from the BIO
connector
d) use maxThreads instead of maxConnections for the BIO connector
e) update the docs

I agree (especially after your additional clarifications in reply to Konstantin).

Rainer

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
For additional commands, e-mail: dev-h...@tomcat.apache.org

Reply via email to