Sailesh Mukil has posted comments on this change. Change subject: IMPALA-3575: Add retry to backend connection request and rpc timeout ......................................................................
Patch Set 4: (8 comments) http://gerrit.cloudera.org:8080/#/c/3343/2//COMMIT_MSG Commit Message: PS2, Line 7: Add retry to backend connection request Maybe add a brief line explaining the motivation behind this. http://gerrit.cloudera.org:8080/#/c/3343/4/be/src/runtime/client-cache.cc File be/src/runtime/client-cache.cc: Line 125: void ClientCacheHelper::ReleaseClient(ClientKey* client_key) { I think it would be better to reset the timeouts to the default values when we're releasing the client back to the cache. So that the next user of said cached client wouldn't have funky timeouts pre-set. (This makes sense only if you choose to remove SetRpcTimeout() from DoRpc() and put it in the ClientConnection() constructor instead) http://gerrit.cloudera.org:8080/#/c/3343/4/be/src/runtime/client-cache.h File be/src/runtime/client-cache.h: PS4, Line 101: SetRpcTimeout Would this be clearer as SetClientRpcTimeout() here and below? On first look I thought it was set as the default for the entire client cache. PS4, Line 239: SetRpcTimeout(send_timeout, recv_timeout); I'm a little worried about calling this every time we call a DoRpc() since this method uses a lock. It would be better if we could avoid this. One possibility is to set it during the creation of the ClientConnection where we can call SetRpcTimeout() only if it is created with explicit timeouts. Else the default will automatically be applied to that ThriftClientImpl. So the ClientConnection() constructor signature will become something like this: ClientConnection(ClientCache<T>* client_cache, TNetworkAddress address, Status* status, int32_t send_timeout = -1, int32_t recv_timeout = -1) And right now I don't think we need the functionality of setting a RPC timeout per RPC vs per client. (correct me if I'm wrong) http://gerrit.cloudera.org:8080/#/c/3343/4/be/src/runtime/plan-fragment-executor.cc File be/src/runtime/plan-fragment-executor.cc: PS4, Line 556: is_cancelled_ Can this also be IsCompleted()? Is it a possibility that there is a sequence of events where 'closed_' or 'done_' is set and this still ends up being called? Line 559: UpdateStatus(Status(TErrorCode::CANCELLED, "Fragment already cancelled.")); Should this be an error? Or is it better to just return since the user will already see that the query is cancelled. http://gerrit.cloudera.org:8080/#/c/3343/4/be/src/service/fragment-exec-state.cc File be/src/service/fragment-exec-state.cc: PS4, Line 128: cause problem If we know exactly what sort of problems it could cause, it would be worth briefly mentioning here. http://gerrit.cloudera.org:8080/#/c/3343/4/tests/custom_cluster/test_rpc_timeout.py File tests/custom_cluster/test_rpc_timeout.py: PS4, Line 39: assert "RPC" in str(ex) 'ex' should contain "RPC timed out" right? This would pass for "RPC Error" as well. -- To view, visit http://gerrit.cloudera.org:8080/3343 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964 Gerrit-PatchSet: 4 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <[email protected]> Gerrit-Reviewer: Alan Choi <[email protected]> Gerrit-Reviewer: Henry Robinson <[email protected]> Gerrit-Reviewer: Juan Yu <[email protected]> Gerrit-Reviewer: Sailesh Mukil <[email protected]> Gerrit-HasComments: Yes
