Hi Jesse,

Am 24.02.2015 um 19:27 schrieb Jesse Defer:
-----Original Message-----
From: Rainer Jung [mailto:rainer.j...@kippdata.de]
Sent: Monday, February 23, 2015 5:00 PM
To: Tomcat Users List
Subject: Re: mod_jk causing stuck Apache processes

Am 24.02.2015 um 00:29 schrieb Jesse Defer:

I suggest:

- check whether there's a performance problem on the webapp side when it
happens next time
    - check thread dumps
    - check GC log

No long pauses that I can see.

    - activate Tomcat access log and add "%D" to the pattern, so that you
can analyze request duration on the Tomcat side post-mortem
    - using the request duration logs, you can also check, whether a
potential performance problem happens on all nodes or only on some.
Depending on the result, the GC log check and thread dump creation and
check would be easier or harder (more work).

- run strace with time stamps plus system call duration next time the
problem occurs

http://pastebin.com/raw.php?i=X09ZTW4V

- also add %D to the LogFormat of your Apache access log. Do also add the
name of the load balancer member that handled the request to the access
log. Mod_jk provides "notes", that can be logged in the access log (look
for "JK_LB_LAST_NAME" in
http://tomcat.apache.org/connectors-doc/reference/apache.html)

- prepare switching to a threaded MPM, be it worker or event. Use
JkWatchdogInterval there.

- put JkShmFile to a local file system

- check whether your system clock on the Apache servers works correctly
and doesn't jump (common virtualization problem)

NTP is configured on all hosts and we monitor drift with collectd every 10
seconds, no anomalies.

- configure the jk status worker and check, whether many of your workers
are in ERROR state. You can also retrieve info from the JK status worker
in a simple to parse text format and automate retrieval and extraction of
data (similar to what one can do with the Apache server-status)

I haven't seen the status worker show any workers in error.

Here are some logs with request duration.  First entry in each pair is httpd,
second is tomcat.  It is interesting that the tomcat response is very fast
but in two of the cases the apache response is very slow.  I wasn't able to
correlate the JVM stack traces for these, maybe because they executed quickly
and weren't blocked.

10.142.64.251 - - [24/Feb/2015:10:34:37 -0700] "GET /my/ HTTP/1.1" 200 72209 "-" "Mozilla/5.0" 
pid=25649 tid=47017797431248 time_s=1 time_us=1174645 x_id=VOy2LYHbdfEAAGQxXQkAAAC0 content_type="text/html" 
last_modified="-"
cache_control="no-cache" worker=mylb first_name=my27 first_busy=0 
last_name=my27 last_busy=0 ssl_protocol=TLSv1 ssl_cipher=AES256-SHA
10.142.64.251 - - [24/Feb/2015:10:34:37 -0700] "GET /my/ HTTP/1.1" 200 72209 "-" 
"Mozilla/5.0" time_ms=115 tomcat_thread=ajp-62014-31 x_id=VOy2LYHbdfEAAGQxXQkAAAC0 app=my

10.142.64.251 - - [24/Feb/2015:10:31:02 -0700] "GET /my/ HTTP/1.1" 200 8184 "-" "Mozilla/5.0" pid=25697 
tid=47017797431248 time_s=246 time_us=246422721 x_id=VOy1VoHbdfEAAGRhuqQAAADq content_type="text/html" 
last_modified="-" cache_control="-" worker=mylb first_name=my27 first_busy=0 last_name=my27 last_busy=0 
ssl_protocol=TLSv1 ssl_cipher=AES256-SHA
10.142.64.251 - - [24/Feb/2015:10:35:09 -0700] "GET /my/ HTTP/1.1" 200 72209 "-" 
"Mozilla/5.0" time_ms=53 tomcat_thread=ajp-62014-32 x_id=VOy1VoHbdfEAAGRhuqQAAADq app=my

10.142.64.251 - - [24/Feb/2015:10:30:41 -0700] "GET /my/ HTTP/1.1" 200 8184 "-" "Mozilla/5.0" pid=25772 
tid=47017797431248 time_s=371 time_us=371698757 x_id=VOy1QYHbdfEAAGSsv-oAAAEy content_type="text/html" 
last_modified="-" cache_control="-" worker=mylb first_name=my27 first_busy=0 last_name=my27 last_busy=0 
ssl_protocol=TLSv1 ssl_cipher=AES256-SHA
10.142.64.251 - - [24/Feb/2015:10:36:52 -0700] "GET /my/ HTTP/1.1" 200 72209 "-" 
"Mozilla/5.0" time_ms=57 tomcat_thread=ajp-62014-5 x_id=VOy1QYHbdfEAAGSsv-oAAAEy app=my

I let one server stay busy for about an hour to collect some logs.
There are some slow requests for this app in the tomcat logs, but few over
60 seconds and only 1 over 120 seconds (for all 17 of this app's tomcats).
Meanwhile there are thousands of httpd requests over 120s from just the
one web server with the stuck processes.  Plotting the request time for
the entire httpd farm shows the affected server has a large jump but
no change in the other servers.

So here is a patch for version 1.2.40:

http://people.apache.org/~rjung/patches/jk_shm_lock.patch

The patch is non-trivial though. It reduces the load on the lock by

- moving some trivial stuff done for each lb member in a separate lock/unlock block (per member) to the lb itself, using its own lock/unlock block. That doesn't reduce the total time locks are held, but the number of lock/unlock ops.

- using a different mechanism to check, whether the global maintenance should actually run (based on atomics and a global decision instead of a per lb decision). That should be more efficient and again reduces lock/unlock ops, because the check can now be done without the lock and only if the global maintenance has to run is the locking needed. Introduces atomics though.

You would have to rebuild mod_jk completely, not only the changed files, because of header file changes. So do a "make clean" before the "make".

There are no configuration changes needed. You could run one of your Apaches with the patched mod_jk and compare behavior. Give it a try on a test system first please. I haven't done a thorough testing myself yet.

I plan to include this in the next version (1.2.41), which is not far away. Testing and checking whether it helps would be really helpful.

Regards and thanks

Rainer


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

Reply via email to