Leif Hedstrom created TS-4801:
---------------------------------
Summary: Are we marking parents down too aggressively?
Key: TS-4801
URL: https://issues.apache.org/jira/browse/TS-4801
Project: Traffic Server
Issue Type: Bug
Components: Parent Proxy
Reporter: Leif Hedstrom
[~jrushford] and I were looking at some code, and found this piece which looks
suspicious:
{code}
} else if (s->current.attempts < s->txn_conf->parent_connect_attempts) {
s->current.attempts++;
// Are we done with this particular parent?
if ((s->current.attempts - 1) %
s->http_config_param->per_parent_connect_attempts != 0) {
// No we are not done with this parent so retry
s->next_action = how_to_open_connection(s);
DebugTxn("http_trans", "%s Retrying parent for attempt %d, max %"
PRId64, "[handle_response_from_parent]",
s->current.attempts,
s->http_config_param->per_parent_connect_attempts);
return;
} else {
DebugTxn("http_trans", "%s %d per parent attempts exhausted",
"[handle_response_from_parent]", s->current.attempts);
// Only mark the parent down if we failed to connect
// to the parent otherwise slow origin servers cause
// us to mark the parent down
if (s->current.state == CONNECTION_ERROR) {
s->parent_params->markParentDown(&s->parent_result);
}
// We are done so look for another parent if any
next_lookup = find_server_and_update_current_info(s);
}
} else {
// Done trying parents... fail over to origin server if that is
// appropriate
DebugTxn("http_trans", "[handle_response_from_parent] Error. No more
retries.");
s->parent_params->markParentDown(&s->parent_result);
s->parent_result.result = PARENT_FAIL;
next_lookup = find_server_and_update_current_info(s);
}
{code}
Here's my thinking: It seems that if we exhaust parent_connect_attempts (which
is proxy.config.http.parent_proxy.total_connect_attempts, default 4), we end up
always marking whatever the current parent is as potentially down.
This seems wrong, because it might do this even if the number of tries against
that particular parent has not been exhausted (that is the
proxy.config.http.parent_proxy.per_parent_connect_attempts setting, tested
inside the loop).
Looking at this, it feels that we should either remove that markParentDown()
entirely, or at a minimum add the same checks as above, i.e.
{code}
- s->parent_params->markParentDown(&s->parent_result);
+ if (s->current.state == CONNECTION_ERROR) {
+ s->parent_params->markParentDown(&s->parent_result);
+ }
{code}
I'm still doing some more tests, one concern is that it's unclear if the
earlier comment is correct, and that we can detect the difference between a
connection failure to parent, vs an origin server response to the parent is
just being slow (resulting in a timeout and no response to child).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)