----------------------------------------------------------------
BEFORE YOU POST, search the faq at <http://java.apache.org/faq/>
WHEN YOU POST, include all relevant version numbers, log files,
and configuration files.  Don't make us guess your problem!!!
----------------------------------------------------------------


I'm looking at failover issues in our apache jserv deployments.

We keep our jserv instances on separate machines from our web servers,
and start apache and jserv instances independenty.  We use apache
1.3.12 with jserv 1.1.2.

There are two primary ways a jserv instance can die:

1. the process can be killed or die (usually explicitly)

2. the machine the instance is on can power down or lose network
   connectivity.

The first case is no problem, mod_jserv will get an immediate
"connection refused" and move on to the next jserv in the balance
list. The second case is more of a problem; mod_jserv needs to wait
for the tcp timeout, which is unacceptably long.

I understand that this is a feature of TCP/IP, and I'm not suggesting
this is a bug in JServ.  But we need to deal with it somehow, mostly
in the case where a machine goes bad and has to be taken offline (or
crashes hard.)

We *could* do this:

1. detect that an application server is dead, (thru the SNMP
   monitoring we have in place, or whatever)

2. re-generate the load balance config on apache, and restart apache.

The problem with this is that it requires human intervention, and
there will still be some number of customers left hanging on requests
to the dead app servers instances.

It would be better if somehow the watchdog process could take a
somewhat aggressive approach with timeouts, and remove servers from
the balance group if there was no response within some configurable
amount.

I've been looking through the mod_jserv code, and there's a lot of
stuff using the apache API timeout stuff, but that seems like it's at
a higher level: it won't help you get to the next living machine in
the balance group.  Am I understanding this right?

Has anyone else come up with a solution to this issue?

Would it be insane or wrong of me to try to put something in watchdog
so that if a JServ didn't respond to ping in 5 seconds, it was removed
from the balance group?  (or, if not 5, some configurable small number
of seconds, much less than 60).

Thanks for any advice/comments.

billo


--
--------------------------------------------------------------
Please read the FAQ! <http://java.apache.org/faq/>
To subscribe:        [EMAIL PROTECTED]
To unsubscribe:      [EMAIL PROTECTED]
Search Archives: 
<http://www.mail-archive.com/java-apache-users%40list.working-dogs.com/>
Problems?:           [EMAIL PROTECTED]

Reply via email to