Jess Holle wrote:
proxy_handler() calls ap_proxy_pre_request() inside a do loop over balanced workers.

This in turn calls proxy_balancer_pre_request() which does

    (*worker)->s->busy++.

Correspondingly proxy_balancer_post_request() does:

        if (worker && worker->s->busy)
            worker->s->busy--;

Unfortunately, proxy_handler only calls proxy_run_post_request() and thus proxy_balancer_post_request() outside the do loop. Thus the "busy" count of workers which currently cannot take requests (e.g. that are currently dead) increases without bound due to retries -- and is never reset.

Does anyone (i.e. who is more familiar with this code) have suggestions for how this should be fixed? If not, I can take a swing at it.

Similarly, when retrying workers in various routines in mod_proxy_balancer.c those worker's lbstatus is incremented. If the retry fails, however, the lbstatus is never reset. This issue also leads to an lbstatus that increases without bound. Just because a worker was dead for 8 hours does not mean it can handle all the work load now. It needs to start fresh -- not 8 hours in the hole. This issue also creates an unduly huge impact when doing

    mycandidate->s->lbstatus -= total_factor;

Actually I'm offbase here. total_factor places undue emphasis on any worker that satisfies a request when multiple dead workers are retried. For instance, if there are 7 dead workers, all being retried, 2 healthy workers, and all with an lbfactor of 1 the worker that gets the request gets its lbstatus decremented by 9, whereas it really should only be decremented by 2 -- else the weighting gets thrown way off. However, it is /not/ thrown off more due to the huge lbstatus values that build up in dead workers. That only becomes an issue when dead workers come to life.

We're seeing the load balancing be thrown dramatically off in this case.

Does anyone have suggestions for how this should be fixed? If not, again I can take a swing at this, e.g. reseting lbstatus to 0 in ap_proxy_retry_worker().

It *seems* like both of the issue center on handling of dead workers, especially having a multiple dead workers and/or workers that are dead for long periods of time.

I've not yet checked whether mod_jk (where I believe these basic algorithms came from) has similar issues.

--
Jess Holle


Reply via email to