-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54957/#review172990
-----------------------------------------------------------
I see three catgories of timeouts here:
(a) Aurora scheduler cannot be reached
If a connection between Aurora client and scheduler has not been established
yet or has been invalidated/terminated due to error (e.g., transport exception)
then client makes an attempt to stablish the connection before submitting an
operation to the scheduler. If stablishing the connection takes longer than 1
minute, a `TimeoutError` is raised in the following context in
`method_wrapper()` that is then trapped, leading to an automatic retry of the
operation.
```
@functools.wraps(method)
def method_wrapper(*args):
with self._lock:
start = time.time()
while not self._terminating.is_set() and (
time.time() - start) < self.RPC_MAXIMUM_WAIT.as_(Time.SECONDS):
try:
// self.client() raises self.TimeoutError after 1 minute if it
cannot establish a connection
method = getattr(self.client(), method_name)
if not callable(method):
return method
// raises TTransport.TTransportException for HTTP response code in
the ranges 4XX and 5XX
resp = method(*args)
if resp is not None and resp.responseCode ==
ResponseCode.ERROR_TRANSIENT:
raise self.TransientError(", ".join(
[m.message for m in resp.details] if resp.details else []))
...
except (TTransport.TTransportException, self.TimeoutError,
self.TransientError) as e:
if not self._terminating.is_set():
log.warning('Connection error with scheduler: %s,
reconnecting...' % e)
self.invalidate()
self._terminating.wait(self.RPC_RETRY_INTERVAL.as_(Time.SECONDS))
...
```
(b) Transport exception occured in the communication
A transport exception (e.g., HTTP 503 Service Temporarily Unavailable) can be
raised in the communication link between Aurora and scheduler in
`TTransportException.flush()`.
```
def flush(self):
...
try:
response = self._session.post(
self.__uri,
data=data,
timeout=self.__timeout,
auth=self.__auth)
response.raise_for_status()
...
```
This `TTransport.TTransportException` is trapped in
`SchedulerClient.method_wrapper()`, leading to an automatic retry of the
operation.
(c) Transient state error in scheduler
If Aurora scheduler discovers that its storage is not in the `READY` state to
process a storage operation (i.e., read, write, or snapshot) it throws
`Storage.TransientException`. This is caught by `LoggingInterceptor` and
translated to a thrift response with `ResponseCode.ERROR_TRANSIENT`.
```
@Override
public Object invoke(MethodInvocation invocation) throws Throwable {
...
Response response = null;
try {
// casting is safe, interception happens on methods that return Response
or its subclasses
response = (Response) invocation.proceed();
} catch (Storage.TransientStorageException e) {
LOG.warn("Uncaught transient exception while handling {}({})",
methodName, messageArgs, e);
response = Responses.addMessage(Responses.empty(),
ResponseCode.ERROR_TRANSIENT, e);
}
...
```
Aurora client receives `ResponseCode.ERROR_TRANSIENT` and raises
`TransientError` which is trapped in `SchedulerClient.method_wrapper()`,
leading to an automatic retry of the operation.
Based on the above analysis, I believe:
- (a) is always safe to retry; it just establishes a connection with the
scheduler.
- (b) is not safe to retry; the operation may or may not have been processed by
the scheduler.
- (c) is always safe to retry; previous attempt to modify storage has failed.
- Mehrdad Nurolahzade
On March 24, 2017, 11:30 a.m., Karthik Anantha Padmanabhan wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54957/
> -----------------------------------------------------------
>
> (Updated March 24, 2017, 11:30 a.m.)
>
>
> Review request for Aurora, David McLaughlin, Joshua Cohen, Santhosh Kumar
> Shanmugham, and Zameer Manji.
>
>
> Repository: aurora
>
>
> Description
> -------
>
> This diff adds an option to not retry api calls to the scheduler. For some of
> the non-idempotent operations we would like to not automatically retry. This
> patch makes this functionality available only to the `schedule_backup_now`
> command.
>
> If there is consensus, this can be added to all commands as well.
>
>
> Diffs
> -----
>
> src/main/python/apache/aurora/admin/admin.py
> 070c348d2ca5db1edecf832efd9aa5481bddaa4b
> src/main/python/apache/aurora/client/api/__init__.py
> e1dde638bd1d686269fbcd88cb083a52e7f5dbfc
> src/main/python/apache/aurora/client/api/scheduler_client.py
> 9bbfece012e48e0b1752bbefd25c89e04d312cf6
> src/test/python/apache/aurora/client/api/test_scheduler_client.py
> f6018caa4f431e85a9e9ff203ac3d4b6c33f40ef
>
>
> Diff: https://reviews.apache.org/r/54957/diff/6/
>
>
> Testing
> -------
>
> * Manual testing
> * ./build-support/jenkins/build.sh passes
>
>
> Thanks,
>
> Karthik Anantha Padmanabhan
>
>