I am currently hunting down an issue where a balancer member that is set to 
error is reused before the retry time runs out.
I think the reason is some race condition around line 2900 in proxy_util.c

    /*
     * Put the entire worker to error state if
     * the PROXY_WORKER_IGNORE_ERRORS flag is not set.
     * Altrough some connections may be alive
     * no further connections to the worker could be made
     */
    if (!connected && PROXY_WORKER_IS_USABLE(worker) &&
        !(worker->s->status & PROXY_WORKER_IGNORE_ERRORS)) {
        worker->s->error_time = apr_time_now();
        worker->s->status |= PROXY_WORKER_IN_ERROR;
        ap_log_error(APLOG_MARK, APLOG_ERR, 0, s, APLOGNO(00959)
            "ap_proxy_connect_backend disabling worker for (%s) for %"
            APR_TIME_T_FMT "s",
            worker->s->hostname, apr_time_sec(worker->s->retry));
    }
    else {
        if (worker->s->retries) {
            /*
             * A worker came back. So here is where we need to
             * either reset all params to initial conditions or
             * apply some sort of aging
             */
        }
        worker->s->error_time = 0;
        worker->s->retries = 0;
    }

I suspect that the worker was already set to error by a parallel thread / 
process and hence
PROXY_WORKER_IS_USABLE(worker) is false and causes worker->s->error_time to be 
reset which causes the worker to be open
for retry immediately. This has been the case since r104624
(http://svn.apache.org/viewvc?view=revision&revision=r104624) 10,5 years ago 
and the commit messages offers no hint at
least to be why we reset these values.
Can anybody think of a good reason why we do this?
Another question is if we shouldn't do

worker->s->error_time = apr_time_now();

also in case the worker is already in error state to restart the retry clock as 
we just faced an error with connecting
to the backend.


Regards

Rüdiger

Reply via email to