Jess Holle wrote:
proxy_handler() calls ap_proxy_pre_request() inside a do loop over
balanced workers.
This in turn calls proxy_balancer_pre_request() which does
(*worker)->s->busy++.
Correspondingly proxy_balancer_post_request() does:
if (worker && worker->s->busy)
worker->s->busy--;
Unfortunately, proxy_handler only calls proxy_run_post_request() and
thus proxy_balancer_post_request() outside the do loop. Thus the
"busy" count of workers which currently cannot take requests (e.g.
that are currently dead) increases without bound due to retries -- and
is never reset.
Does anyone (i.e. who is more familiar with this code) have
suggestions for how this should be fixed? If not, I can take a swing
at it.
Similarly, when retrying workers in various routines in
mod_proxy_balancer.c those worker's lbstatus is incremented. If the
retry fails, however, the lbstatus is never reset. This issue also
leads to an lbstatus that increases without bound. Just because a
worker was dead for 8 hours does not mean it can handle all the work
load now. It needs to start fresh -- not 8 hours in the hole. This
issue also creates an unduly huge impact when doing
mycandidate->s->lbstatus -= total_factor;
Actually I'm offbase here. total_factor places undue emphasis on any
worker that satisfies a request when multiple dead workers are retried.
For instance, if there are 7 dead workers, all being retried, 2 healthy
workers, and all with an lbfactor of 1 the worker that gets the request
gets its lbstatus decremented by 9, whereas it really should only be
decremented by 2 -- else the weighting gets thrown way off. However, it
is /not/ thrown off more due to the huge lbstatus values that build up
in dead workers. That only becomes an issue when dead workers come to life.
We're seeing the load balancing be thrown dramatically off in this case.
Does anyone have suggestions for how this should be fixed? If not,
again I can take a swing at this, e.g. reseting lbstatus to 0 in
ap_proxy_retry_worker().
It *seems* like both of the issue center on handling of dead workers,
especially having a multiple dead workers and/or workers that are dead
for long periods of time.
I've not yet checked whether mod_jk (where I believe these basic
algorithms came from) has similar issues.
--
Jess Holle