-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54957/#review172990
-----------------------------------------------------------



I see three catgories of timeouts here:

(a) Aurora scheduler cannot be reached

If a connection between Aurora client and scheduler has not been established 
yet or has been invalidated/terminated due to error (e.g., transport exception) 
then client makes an attempt to stablish the connection before submitting an 
operation to the scheduler. If stablishing the connection takes longer than 1 
minute, a `TimeoutError` is raised in the following context in 
`method_wrapper()` that is then trapped, leading to an automatic retry of the 
operation.

```
    @functools.wraps(method)
    def method_wrapper(*args):
      with self._lock:
        start = time.time()
        while not self._terminating.is_set() and (
            time.time() - start) < self.RPC_MAXIMUM_WAIT.as_(Time.SECONDS):

          try:
            // self.client() raises self.TimeoutError after 1 minute if it 
cannot establish a connection
            method = getattr(self.client(), method_name)
            if not callable(method):
              return method
            // raises TTransport.TTransportException for HTTP response code in 
the ranges 4XX and 5XX
            resp = method(*args)
            if resp is not None and resp.responseCode == 
ResponseCode.ERROR_TRANSIENT:
              raise self.TransientError(", ".join(
                  [m.message for m in resp.details] if resp.details else []))   
         
          ...
          except (TTransport.TTransportException, self.TimeoutError, 
self.TransientError) as e:
            if not self._terminating.is_set():
              log.warning('Connection error with scheduler: %s, 
reconnecting...' % e)
              self.invalidate()
              self._terminating.wait(self.RPC_RETRY_INTERVAL.as_(Time.SECONDS))
        ...
```

(b) Transport exception occured in the communication

A transport exception (e.g., HTTP 503 Service Temporarily Unavailable) can be 
raised in the communication link between Aurora and scheduler in 
`TTransportException.flush()`. 

```
  def flush(self):
    ...
    try:
      response = self._session.post(
          self.__uri,
          data=data,
          timeout=self.__timeout,
          auth=self.__auth)
      response.raise_for_status()
    ...
```

This `TTransport.TTransportException` is trapped in 
`SchedulerClient.method_wrapper()`, leading to an automatic retry of the 
operation.

(c) Transient state error in scheduler

If Aurora scheduler discovers that its storage is not in the `READY` state to 
process a storage operation (i.e., read, write, or snapshot) it throws 
`Storage.TransientException`. This is caught by `LoggingInterceptor` and 
translated to a thrift response with `ResponseCode.ERROR_TRANSIENT`.

```
  @Override
  public Object invoke(MethodInvocation invocation) throws Throwable {
    ...
    Response response = null;
    try {
      // casting is safe, interception happens on methods that return Response 
or its subclasses
      response = (Response) invocation.proceed();
    } catch (Storage.TransientStorageException e) {
      LOG.warn("Uncaught transient exception while handling {}({})", 
methodName, messageArgs, e);
      response = Responses.addMessage(Responses.empty(), 
ResponseCode.ERROR_TRANSIENT, e);
    } 
    ...
```

Aurora client receives `ResponseCode.ERROR_TRANSIENT` and raises 
`TransientError` which is trapped in `SchedulerClient.method_wrapper()`, 
leading to an automatic retry of the operation.

Based on the above analysis, I believe:
- (a) is always safe to retry; it just establishes a connection with the 
scheduler.
- (b) is not safe to retry; the operation may or may not have been processed by 
the scheduler.
- (c) is always safe to retry; previous attempt to modify storage has failed.

- Mehrdad Nurolahzade


On March 24, 2017, 11:30 a.m., Karthik Anantha Padmanabhan wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54957/
> -----------------------------------------------------------
> 
> (Updated March 24, 2017, 11:30 a.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Joshua Cohen, Santhosh Kumar 
> Shanmugham, and Zameer Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> This diff adds an option to not retry api calls to the scheduler. For some of 
> the non-idempotent operations we would like to not automatically retry. This 
> patch makes this functionality available only to the `schedule_backup_now` 
> command.
> 
> If there is consensus, this can be added to all commands as well.
> 
> 
> Diffs
> -----
> 
>   src/main/python/apache/aurora/admin/admin.py 
> 070c348d2ca5db1edecf832efd9aa5481bddaa4b 
>   src/main/python/apache/aurora/client/api/__init__.py 
> e1dde638bd1d686269fbcd88cb083a52e7f5dbfc 
>   src/main/python/apache/aurora/client/api/scheduler_client.py 
> 9bbfece012e48e0b1752bbefd25c89e04d312cf6 
>   src/test/python/apache/aurora/client/api/test_scheduler_client.py 
> f6018caa4f431e85a9e9ff203ac3d4b6c33f40ef 
> 
> 
> Diff: https://reviews.apache.org/r/54957/diff/6/
> 
> 
> Testing
> -------
> 
> * Manual testing
> * ./build-support/jenkins/build.sh passes
> 
> 
> Thanks,
> 
> Karthik Anantha Padmanabhan
> 
>

Reply via email to