Henry Robinson has posted comments on this change. Change subject: IMPALA-3575: Add retry to backend connection request and rpc timeout ......................................................................
Patch Set 15: (11 comments) How are you going to test all these failure paths for all the RPCs we have in the system? How do we convince ourselves that these failures correctly lead to query tear down in all cases? http://gerrit.cloudera.org:8080/#/c/3343/15/be/src/runtime/client-cache.h File be/src/runtime/client-cache.h: Line 97: void DestroyClient(ClientKey* client_key); needs a comment PS15, Line 197: hasRpcError_ C++-style naming PS15, Line 239: IsTimeoutTException should now be called IsRecvTimeoutTException() PS15, Line 240: RPC_TIMEOUT should be RPC_RECV_TIMEOUT PS15, Line 246: if (strstr(e.what(),"unknown result") != NULL) this seems very brittle, and at least should be wrapped in an IsXXXException() kind of method. PS15, Line 268: release back pressure is added by a blocking RPC, not released. PS15, Line 273: timeout what's the unit? PS15, Line 272: Status DoRpcTimedWait(const F& f, const Request& request, Response* response, : const G& wait_func, uint64_t timeout, RuntimeState* state, bool* can_retry = NULL) This seems like it breaks abstractions: not all RPCs happen in the context of a query. I also don't know what wait_func is for, based on the comments. I think this might be better written as two methods, which would be called like this: Status status = client->DoRpc(...., &can_retry); while (status.code() == TErrorCode::RPC_TIMEOUT && can_retry) { check_if_cancelled_or_timeout(); status = client->RetryRpcRecv(recv_func); } I understand the idea behind pushing all this logic into this method, but I think it actually hides too much from the caller. This allows the caller to take appropriate action on network timeouts without handing in a runtime state to the client (e.g. the caller may want to do some more work before blocking again). The other thing you could do is pass in a callback, so we'd have: Status status = client->DoRpc(...., [state]() { return state->is_cancelled(); }, ...); but my fear is that would get a bit messy. PS15, Line 282: bool no_timeout = timeout == 0; this can be simplified by having a deadline variable: uint64_t deadline = MonotonicMillis() + timeout; PS15, Line 309: bool hasRpcError_; the role of this variable is not clear. Please add a comment. http://gerrit.cloudera.org:8080/#/c/3343/15/be/src/runtime/coordinator.cc File be/src/runtime/coordinator.cc: PS15, Line 1490: // Try to send the RPC 3 times before failing. Why try 3 times? Have you seen in your testing where there's failure on the first try, but success on attempts 2 or 3? -- To view, visit http://gerrit.cloudera.org:8080/3343 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964 Gerrit-PatchSet: 15 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <[email protected]> Gerrit-Reviewer: Alan Choi <[email protected]> Gerrit-Reviewer: Dan Hecht <[email protected]> Gerrit-Reviewer: Henry Robinson <[email protected]> Gerrit-Reviewer: Juan Yu <[email protected]> Gerrit-Reviewer: Sailesh Mukil <[email protected]> Gerrit-HasComments: Yes
