I am currently hunting down an issue where a balancer member that is set to
error is reused before the retry time runs out.
I think the reason is some race condition around line 2900 in proxy_util.c
/*
* Put the entire worker to error state if
* the PROXY_WORKER_IGNORE_ERRORS flag is not set.
* Altrough some connections may be alive
* no further connections to the worker could be made
*/
if (!connected && PROXY_WORKER_IS_USABLE(worker) &&
!(worker->s->status & PROXY_WORKER_IGNORE_ERRORS)) {
worker->s->error_time = apr_time_now();
worker->s->status |= PROXY_WORKER_IN_ERROR;
ap_log_error(APLOG_MARK, APLOG_ERR, 0, s, APLOGNO(00959)
"ap_proxy_connect_backend disabling worker for (%s) for %"
APR_TIME_T_FMT "s",
worker->s->hostname, apr_time_sec(worker->s->retry));
}
else {
if (worker->s->retries) {
/*
* A worker came back. So here is where we need to
* either reset all params to initial conditions or
* apply some sort of aging
*/
}
worker->s->error_time = 0;
worker->s->retries = 0;
}
I suspect that the worker was already set to error by a parallel thread /
process and hence
PROXY_WORKER_IS_USABLE(worker) is false and causes worker->s->error_time to be
reset which causes the worker to be open
for retry immediately. This has been the case since r104624
(http://svn.apache.org/viewvc?view=revision&revision=r104624) 10,5 years ago
and the commit messages offers no hint at
least to be why we reset these values.
Can anybody think of a good reason why we do this?
Another question is if we shouldn't do
worker->s->error_time = apr_time_now();
also in case the worker is already in error state to restart the retry clock as
we just faced an error with connecting
to the backend.
Regards
Rüdiger