Leif Hedstrom created TS-4801:
---------------------------------

             Summary: Are we marking parents down too aggressively?
                 Key: TS-4801
                 URL: https://issues.apache.org/jira/browse/TS-4801
             Project: Traffic Server
          Issue Type: Bug
          Components: Parent Proxy
            Reporter: Leif Hedstrom


[~jrushford] and I were looking at some code, and found this piece which looks 
suspicious:

{code}
    } else if (s->current.attempts < s->txn_conf->parent_connect_attempts) {
      s->current.attempts++;

      // Are we done with this particular parent?
      if ((s->current.attempts - 1) % 
s->http_config_param->per_parent_connect_attempts != 0) {
        // No we are not done with this parent so retry
        s->next_action = how_to_open_connection(s);
        DebugTxn("http_trans", "%s Retrying parent for attempt %d, max %" 
PRId64, "[handle_response_from_parent]",
                 s->current.attempts, 
s->http_config_param->per_parent_connect_attempts);
        return;
      } else {
        DebugTxn("http_trans", "%s %d per parent attempts exhausted", 
"[handle_response_from_parent]", s->current.attempts);

        // Only mark the parent down if we failed to connect
        //  to the parent otherwise slow origin servers cause
        //  us to mark the parent down
        if (s->current.state == CONNECTION_ERROR) {
          s->parent_params->markParentDown(&s->parent_result);
        }
        // We are done so look for another parent if any
        next_lookup = find_server_and_update_current_info(s);
      }
    } else {
      // Done trying parents... fail over to origin server if that is
      //   appropriate
      DebugTxn("http_trans", "[handle_response_from_parent] Error. No more 
retries.");
      s->parent_params->markParentDown(&s->parent_result);
      s->parent_result.result = PARENT_FAIL;
      next_lookup             = find_server_and_update_current_info(s);
    }
{code}

Here's my thinking: It seems that if we exhaust parent_connect_attempts (which 
is proxy.config.http.parent_proxy.total_connect_attempts, default 4), we end up 
always marking whatever the current parent is as potentially down.

This seems wrong, because it might do this even if the number of tries against 
that particular parent has not been exhausted (that is the 
proxy.config.http.parent_proxy.per_parent_connect_attempts setting, tested 
inside the loop).

Looking at this, it feels that we should either remove that markParentDown() 
entirely, or at a minimum add the same checks as above, i.e.

{code}
-      s->parent_params->markParentDown(&s->parent_result);
+      if (s->current.state == CONNECTION_ERROR) {
+        s->parent_params->markParentDown(&s->parent_result);
+      }
{code}

I'm still doing some more tests, one concern is that it's unclear if the 
earlier comment is correct, and that we can detect the difference between a 
connection failure to parent, vs an origin server response to the parent is 
just being slow (resulting in a timeout and no response to child).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to