On 24.05.2010 23:36, LES wrote:
I am having some trouble keeping a mod_jk setup stable. At this point, I feel like I am too far into trial and error mode and would like some help figuring out how to identify the problem. My current setup involves, two linux (RHEL 5) server each running two tomcat instances (6.0.20). A third RHEL 5 box is running apache (2.2.3) with mod_jk(1.2.28). I am using terracotta to "cluster" the tomcat sessions. The problem that I am having is that under small load (and unfortunately, intermittently), I get random nodes that produce errors. Typically these errors indicate that mod_jk can no longer contact tomcat (see excerpts below). In most cases, the the user request just hangs (never returns). So, it also appears that the errors are not causing a session failover -- though I need to confirm that again after my recent round of changes. In most cases, these nodes that are in error recover on their own. However, during the failure event, I get a bunch of unhappy users. I am hoping to find a way to make the nodes more stable and then address the fail-over aspect. I have tried different mod_jk parameters and think I have settled on a decent set of them. I have all of the garbage collection information logging out and do not seem to have any gc events that are taking longer than the request timeout. I am gathering jvm and os stats and do not see a hardware constraint (memory, cpu, io). So, I am a bit of a loss on where to look. I am pasting in all of the relevant files/excerpts that I can think of. I appreciate any advice on what additional data to gather to shed light on this problem (outright solutions are welcome too :)). Please let me know if there is any other information that would be helpful. Thanx, LES ************* workers.properties ************** # Define 1 real worker using ajp13 worker.list=lb,jkstatus,cas # Set properties for worker1 (ajp13) worker.template.type=ajp13 worker.template.retries=4 worker.template.lbfactor=1 worker.template.reply_timeout=300000 worker.template.max_reply_timeouts=4 worker.template.connection_pool_timeout=60 worker.template.ping_mode=A #worker.template.socket_timeout=10
This is in milliseconds, I guess you want 10000:
worker.template.socket_connect_timeout=10 worker.tomcat01-instance1.reference=worker.template worker.tomcat01-instance1.host=tomcat01.barnhardt.local worker.tomcat01-instance1.port=8009 worker.tomcat01-instance2.reference=worker.template worker.tomcat01-instance2.host=tomcat01.barnhardt.local worker.tomcat01-instance2.port=18009 worker.tomcat02-instance1.reference=worker.template worker.tomcat02-instance1.host=tomcat02.barnhardt.local worker.tomcat02-instance1.port=8009 worker.tomcat02-instance2.reference=worker.template worker.tomcat02-instance2.host=tomcat02.barnhardt.local worker.tomcat02-instance2.port=18009 worker.cas.type=ajp13 worker.cas.host=localhost worker.cas.port=8009 worker.cas.lbfactor=1 worker.cas.connection_pool_timeout=600 worker.cas.socket_keepalive=1
I don't like the raw socket_timeout, but well ...
worker.cas.socket_timeout=60 # Set properties for lb which use the other workers worker.lb.type=lb #worker.lb.method=B worker.lb.sticky_session=True worker.lb.balance_workers=tomcat01-instance1,tomcat01-instance2,tomcat02-instance1,tomcat02-instance2 # Define a 'jkstatus' worker using status worker.jkstatus.type=status *********************************************** ****** Errors from log ******* //////This particular error(info) seems to happen constantly - is it a normal operational thing?
Yes, it is not an error, it is an "info2 message. It simply says that all connections from your apache process to tomcat were closed and a fresh one had to be opened.
[Mon May 24 10:22:56 2010] [26131:4045374208] [info] ajp_send_request::jk_ajp_common.c (1496): (tomcat02-instance2) all endpoints are disconnected, detected by connect check (1), cping (0), send (0) [Mon May 24 11:55:21 2010] [2711:4045374208] [info] ajp_send_request::jk_ajp_common.c (1496): (tomcat02-instance1) all endpoints are disconnected, detected by connect check (1), cping (0), send (0) [Mon May 24 13:08:25 2010] [27439:4045374208] [info] ajp_send_request::jk_ajp_common.c (1496): (tomcat01-instance1) all endpoints are disconnected, detected by connect check (1), cping (0), send (0)
So I'd say somethoing gets stuck in your tomcat (likely: your webapp) and mod_jk detects that by use of the reply timeout. Since you have a 5 minute reply timeout, chances are good to find those request and the cause for their hanging or excessively long response time by use of
- a tomcat access log with an improved patern containing "%D" and if your Tomcat is recent enough also "%I"
- and regular thread dumps
////This error happens intermittently and seems to cause some the the cluster problems I mentioned above [Mon May 24 07:19:21 2010] [27432:4045374208] [error] ajp_get_reply::jk_ajp_common.c (1926): (tomcat01-instance2) Timeout with waiting reply from tomcat. Tomcat is down, stopped or network problems (errno=110) [Mon May 24 07:19:23 2010] [27432:4045374208] [info] ajp_service::jk_ajp_common.c (2447): (tomcat01-instance2) sending request to tomcat failed (recoverable), because of reply timeout (attempt=1) [Mon May 24 07:24:23 2010] [27432:4045374208] [error] ajp_get_reply::jk_ajp_common.c (1926): (tomcat01-instance2) Timeout with waiting reply from tomcat. Tomcat is down, stopped or network problems (errno=110) [Mon May 24 07:24:25 2010] [27432:4045374208] [info] ajp_service::jk_ajp_common.c (2447): (tomcat01-instance2) sending request to tomcat failed (recoverable), because of reply timeout (attempt=2)
I guess the nex one is due to the socket_connect_timeout set to 10 milliseconds instead of 10 seconds:
////I get this error occassionally, too [Sun May 23 03:48:51 2010] [15814:4045374208] [info] jk_open_socket::jk_connect.c (594): connect to 192.168.60.157:8009 failed (errno=115) [Sun May 23 03:48:51 2010] [15814:4045374208] [info] ajp_connect_to_endpoint::jk_ajp_common.c (922): Failed opening socket to (192.168.60.157:8009) (errno=115) [Sun May 23 03:48:51 2010] [15814:4045374208] [error] ajp_send_request::jk_ajp_common.c (1507): (tomcat02-instance1) connecting to backend failed. Tomcat is probably not started or is listening on the wrong port (errno=115)
Error number 104 (errno=104) is "Connection reset by peer" n RHEL 5:
////Third time is a charm...another error for the hat trick [Sat May 22 21:41:17 2010] [13933:4045374208] [info] ajp_connection_tcp_get_message::jk_ajp_common.c (1150): (tomcat01-instance1) can't receive the response header message from tomcat, network problems or tomcat (192.168.60.156:8009) is down (errno=104)
Regards, Rainer --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org