juergbi commented on PR #1901: URL: https://github.com/apache/buildstream/pull/1901#issuecomment-2007962394
With a reliable network connection to an available RE server, the client should never get the UNAVAILABLE error, even with non-trivial usage. However, we should indeed retry on UNAVAILABLE to improve operation when a server is temporarily unavailable for any reason. The retry logic is far from optimal, with and without this patch. E.g., there is no retry if the first response of `Execute()` or `WaitExecution()` fails. And there is no retry delay with backoff or a maximum number of attempts. If the initial `Execute()` response is received but there is an error later on, it tries to immediately resume the operation stream by calling `WaitExecution()`. If that fails, it will give up (as `last_operation` will be `None`). If there is a temporary network or server issue, it is likely that the immediate retry fails as well. That said, even the immediate retry on UNAVAILABLE likely helps in some cases. The only thing, that I can think of, where this patch makes anything worse might be to not report the "Failed contacting remote execution server" in case of UNAVAILABLE being fatal (i.e., either it failed already before the first response of `Execute()` in which case there will be no retry, or the immediate retry fails as well). If `last_operation is None`, we should still raise that error, at least as long as we don't retry in that case. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
