On 24.05.2010 23:36, LES wrote:

I am having some trouble keeping a mod_jk setup stable.  At this point, I
feel like I am too far into trial and error mode and would like some help
figuring out how to identify the problem.

My current setup involves, two linux (RHEL 5) server each running two tomcat
instances (6.0.20).  A third RHEL 5 box is running apache (2.2.3) with
mod_jk(1.2.28).  I am using terracotta to "cluster" the tomcat sessions.

The problem that I am having is that under small load (and unfortunately,
intermittently), I get random nodes that produce errors.  Typically these
errors indicate that mod_jk can no longer contact tomcat (see excerpts
below).  In most cases, the the user request just hangs (never returns).
So, it also appears that the errors are not causing a session failover --
though I need to confirm that again after my recent round of changes.  In
most cases, these nodes that are in error recover on their own.  However,
during the failure event, I get a bunch of unhappy users.  I am hoping to
find a way to make the nodes more stable and then address the fail-over
aspect.

I have tried different mod_jk parameters and think I have settled on a
decent set of them.  I have all of the garbage collection information
logging out and do not seem to have any gc events that are taking longer
than the request timeout.  I am gathering jvm and os stats and do not see a
hardware constraint (memory, cpu, io).  So, I am a bit of a loss on where to
look.

I am pasting in all of the relevant files/excerpts that I can think of.  I
appreciate any advice on what additional data to gather to shed light on
this problem (outright solutions are welcome too :)).

Please let me know if there is any other information that would be helpful.

Thanx,
LES


************* workers.properties **************
# Define 1 real worker using ajp13
worker.list=lb,jkstatus,cas
# Set properties for worker1 (ajp13)
worker.template.type=ajp13
worker.template.retries=4
worker.template.lbfactor=1
worker.template.reply_timeout=300000
worker.template.max_reply_timeouts=4
worker.template.connection_pool_timeout=60
worker.template.ping_mode=A
#worker.template.socket_timeout=10

This is in milliseconds, I guess you want 10000:

worker.template.socket_connect_timeout=10

worker.tomcat01-instance1.reference=worker.template
worker.tomcat01-instance1.host=tomcat01.barnhardt.local
worker.tomcat01-instance1.port=8009

worker.tomcat01-instance2.reference=worker.template
worker.tomcat01-instance2.host=tomcat01.barnhardt.local
worker.tomcat01-instance2.port=18009

worker.tomcat02-instance1.reference=worker.template
worker.tomcat02-instance1.host=tomcat02.barnhardt.local
worker.tomcat02-instance1.port=8009

worker.tomcat02-instance2.reference=worker.template
worker.tomcat02-instance2.host=tomcat02.barnhardt.local
worker.tomcat02-instance2.port=18009

worker.cas.type=ajp13
worker.cas.host=localhost
worker.cas.port=8009
worker.cas.lbfactor=1
worker.cas.connection_pool_timeout=600
worker.cas.socket_keepalive=1

I don't like the raw socket_timeout, but well ...

worker.cas.socket_timeout=60

# Set properties for lb which use the other workers
worker.lb.type=lb
#worker.lb.method=B
worker.lb.sticky_session=True
worker.lb.balance_workers=tomcat01-instance1,tomcat01-instance2,tomcat02-instance1,tomcat02-instance2

# Define a 'jkstatus' worker using status
worker.jkstatus.type=status
***********************************************


****** Errors from log *******

//////This particular error(info) seems to happen constantly - is it a
normal operational thing?

Yes, it is not an error, it is an "info2 message. It simply says that all connections from your apache process to tomcat were closed and a fresh one had to be opened.

[Mon May 24 10:22:56 2010] [26131:4045374208] [info]
ajp_send_request::jk_ajp_common.c (1496): (tomcat02-instance2) all endpoints
are disconnected, detected by connect check (1), cping (0), send (0)
[Mon May 24 11:55:21 2010] [2711:4045374208] [info]
ajp_send_request::jk_ajp_common.c (1496): (tomcat02-instance1) all endpoints
are disconnected, detected by connect check (1), cping (0), send (0)
[Mon May 24 13:08:25 2010] [27439:4045374208] [info]
ajp_send_request::jk_ajp_common.c (1496): (tomcat01-instance1) all endpoints
are disconnected, detected by connect check (1), cping (0), send (0)

So I'd say somethoing gets stuck in your tomcat (likely: your webapp) and mod_jk detects that by use of the reply timeout. Since you have a 5 minute reply timeout, chances are good to find those request and the cause for their hanging or excessively long response time by use of

- a tomcat access log with an improved patern containing "%D" and if your Tomcat is recent enough also "%I"
- and regular thread dumps

////This error happens intermittently and seems to cause some the the
cluster problems I mentioned above
[Mon May 24 07:19:21 2010] [27432:4045374208] [error]
ajp_get_reply::jk_ajp_common.c (1926): (tomcat01-instance2) Timeout with
waiting reply from tomcat. Tomcat is down, stopped or network problems
(errno=110)
[Mon May 24 07:19:23 2010] [27432:4045374208] [info]
ajp_service::jk_ajp_common.c (2447): (tomcat01-instance2) sending request to
tomcat failed (recoverable), because of reply timeout (attempt=1)
[Mon May 24 07:24:23 2010] [27432:4045374208] [error]
ajp_get_reply::jk_ajp_common.c (1926): (tomcat01-instance2) Timeout with
waiting reply from tomcat. Tomcat is down, stopped or network problems
(errno=110)
[Mon May 24 07:24:25 2010] [27432:4045374208] [info]
ajp_service::jk_ajp_common.c (2447): (tomcat01-instance2) sending request to
tomcat failed (recoverable), because of reply timeout (attempt=2)

I guess the nex one is due to the socket_connect_timeout set to 10 milliseconds instead of 10 seconds:

////I get this error occassionally, too
[Sun May 23 03:48:51 2010] [15814:4045374208] [info]
jk_open_socket::jk_connect.c (594): connect to 192.168.60.157:8009 failed
(errno=115)
[Sun May 23 03:48:51 2010] [15814:4045374208] [info]
ajp_connect_to_endpoint::jk_ajp_common.c (922): Failed opening socket to
(192.168.60.157:8009) (errno=115)
[Sun May 23 03:48:51 2010] [15814:4045374208] [error]
ajp_send_request::jk_ajp_common.c (1507): (tomcat02-instance1) connecting to
backend failed. Tomcat is probably not started or is listening on the wrong
port (errno=115)

Error number 104 (errno=104) is "Connection reset by peer" n RHEL 5:

////Third time is a charm...another error for the hat trick
[Sat May 22 21:41:17 2010] [13933:4045374208] [info]
ajp_connection_tcp_get_message::jk_ajp_common.c (1150): (tomcat01-instance1)
can't receive the response header message from tomcat, network problems or
tomcat (192.168.60.156:8009) is down (errno=104)

Regards,

Rainer

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

Reply via email to