Dan Hecht has posted comments on this change. Change subject: IMPALA-3575: Add retry to backend connection request and rpc timeout ......................................................................
Patch Set 20: (24 comments) Need to look through the last few files still but here's the first set of comments. http://gerrit.cloudera.org:8080/#/c/3343/20//COMMIT_MSG Commit Message: Line 12: Impala doesn't set socket send/recv timeout for backend client. Prior to this change, ... PS20, Line 24: new RPC call RetryRpcRecv() This makes it sound like RetryRpcRecv() is a new RPC. I haven't looked at the change yet, but I wouldn't expect it to be. Clarify. PS20, Line 28: call delete (C of RPC stands for call) http://gerrit.cloudera.org:8080/#/c/3343/20/be/src/common/global-flags.cc File be/src/common/global-flags.cc: PS20, Line 102: call delete PS20, Line 102: which trigger caller hits RPC timeout. to trigger an RPC timeout on the caller side. http://gerrit.cloudera.org:8080/#/c/3343/20/be/src/runtime/client-cache.cc File be/src/runtime/client-cache.cc: Line 162: client_map_.erase(client); is it legal to use the 'client' iterator even though the lock was dropped in the mean time? i.e. don't we need to re-find in case the map has changed? Line 163: *client_key = NULL; why doesn't this method need to remove from the per_host_caches_ as well? The header comment seems to imply that happens. http://gerrit.cloudera.org:8080/#/c/3343/20/be/src/runtime/client-cache.h File be/src/runtime/client-cache.h: PS20, Line 96: cache is this "the per-host cache" that is referenced everywhere else, or something different? clarify. PS20, Line 96: it what is "it"? the client? or the connection? does this NULL out *client_key like ReleaseClient()? PS20, Line 224: indicate indicates Is retry_is_safe an in parameter or out parameter? Clarify this. also, is it the case that *return_is_safe is only relevant when returned status is not okay? Line 225: /// the whole RPC call. explain this more. is it because if the send RPC call failed, then the handler was never executed to process the RPC. On the other hand, if the send call completed, the RPC can't be restarted because the handler is already in progress. Is that right? Line 260: if (!status.ok()) { so this is the only path where we have *retry_is_safe == true and an error is returned, right? If so, why not set *retry_is_safe to false by default (line 240), and then only set it to true here. That would make the intent of the code clearer because then it's calling out the exceptional failure path where we can retry. Line 265: } catch (apache::thrift::TException& e) { why does this path not need the timeout check also? Line 271: client_is_unrecoverable_ = false; why is this logic needed now, whereas it wasn't needed before? or is this another bug fix that would have been beneficial even without the retry logic? PS20, Line 301: the whether the but why not just call this 'last_rpc_succeeded_' to make it more self explanatory? http://gerrit.cloudera.org:8080/#/c/3343/20/be/src/runtime/data-stream-sender.cc File be/src/runtime/data-stream-sender.cc: Line 115: ImpalaBackendClientCache* client_cache_; now that we store the RuntimeState, let's get rid of this and just get the cache from the runtime state when needed (which isn't very perf critical), to make things easier to follow. PS20, Line 153: wait retry waiting for the response? PS20, Line 153: call delete "call", and would be nice to write RPC PS20, Line 231: timeout_ms where is this called with a non-zero timeout_ms? Line 239: timeout_ms)); this will throw away the original status message. Maybe use AddDetail() instead on the original status. http://gerrit.cloudera.org:8080/#/c/3343/20/be/src/service/fragment-exec-state.cc File be/src/service/fragment-exec-state.cc: Line 125: if (!retry_is_safe) break; so i guess this was a bug in the old code? http://gerrit.cloudera.org:8080/#/c/3343/20/be/src/service/fragment-mgr.cc File be/src/service/fragment-mgr.cc: Line 38: FAULT_INJECTION_RPC_DELAY(RPC_EXECPLANFRAGMENT); is there a reason we decided to put these here rather than in impala-internal-services.h handlers, which would then show we've marked up all the RPCs? http://gerrit.cloudera.org:8080/#/c/3343/20/be/src/statestore/statestore.cc File be/src/statestore/statestore.cc: PS20, Line 658: Rewrite Add details to PS20, Line 669: Rewrite same -- To view, visit http://gerrit.cloudera.org:8080/3343 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964 Gerrit-PatchSet: 20 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <[email protected]> Gerrit-Reviewer: Alan Choi <[email protected]> Gerrit-Reviewer: Dan Hecht <[email protected]> Gerrit-Reviewer: Henry Robinson <[email protected]> Gerrit-Reviewer: Huaisi Xu <[email protected]> Gerrit-Reviewer: Juan Yu <[email protected]> Gerrit-Reviewer: Sailesh Mukil <[email protected]> Gerrit-HasComments: Yes
