[ https://issues.apache.org/jira/browse/TS-4801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15652696#comment-15652696 ]
Phil Sorber commented on TS-4801: --------------------------------- [~jrushford] said that this was backported with TS-3419 back port so I am marking this as back ported but taking no further action. > Are we marking parents down too aggressively? > --------------------------------------------- > > Key: TS-4801 > URL: https://issues.apache.org/jira/browse/TS-4801 > Project: Traffic Server > Issue Type: Bug > Components: Parent Proxy > Reporter: Leif Hedstrom > Assignee: Leif Hedstrom > Fix For: 6.2.1, 7.0.0 > > Time Spent: 1h 40m > Remaining Estimate: 0h > > [~jrushford] and I were looking at some code, and found this piece which > looks suspicious: > {code} > } else if (s->current.attempts < s->txn_conf->parent_connect_attempts) { > s->current.attempts++; > // Are we done with this particular parent? > if ((s->current.attempts - 1) % > s->http_config_param->per_parent_connect_attempts != 0) { > // No we are not done with this parent so retry > s->next_action = how_to_open_connection(s); > DebugTxn("http_trans", "%s Retrying parent for attempt %d, max %" > PRId64, "[handle_response_from_parent]", > s->current.attempts, > s->http_config_param->per_parent_connect_attempts); > return; > } else { > DebugTxn("http_trans", "%s %d per parent attempts exhausted", > "[handle_response_from_parent]", s->current.attempts); > // Only mark the parent down if we failed to connect > // to the parent otherwise slow origin servers cause > // us to mark the parent down > if (s->current.state == CONNECTION_ERROR) { > s->parent_params->markParentDown(&s->parent_result); > } > // We are done so look for another parent if any > next_lookup = find_server_and_update_current_info(s); > } > } else { > // Done trying parents... fail over to origin server if that is > // appropriate > DebugTxn("http_trans", "[handle_response_from_parent] Error. No more > retries."); > s->parent_params->markParentDown(&s->parent_result); > s->parent_result.result = PARENT_FAIL; > next_lookup = find_server_and_update_current_info(s); > } > {code} > Here's my thinking: It seems that if we exhaust parent_connect_attempts > (which is proxy.config.http.parent_proxy.total_connect_attempts, default 4), > we end up always marking whatever the current parent is as potentially down. > This seems wrong, because it might do this even if the number of tries > against that particular parent has not been exhausted (that is the > proxy.config.http.parent_proxy.per_parent_connect_attempts setting, tested > inside the loop). > Looking at this, it feels that we should either remove that markParentDown() > entirely, or at a minimum add the same checks as above, i.e. > {code} > - s->parent_params->markParentDown(&s->parent_result); > + if (s->current.state == CONNECTION_ERROR) { > + s->parent_params->markParentDown(&s->parent_result); > + } > {code} > I'm still doing some more tests, one concern is that it's unclear if the > earlier comment is correct, and that we can detect the difference between a > connection failure to parent, vs an origin server response to the parent is > just being slow (resulting in a timeout and no response to child). -- This message was sent by Atlassian JIRA (v6.3.4#6332)