Sailesh Mukil has posted comments on this change.

Change subject: IMPALA-3575: Add retry to backend connection request and rpc 
timeout
......................................................................


Patch Set 4:

(8 comments)

http://gerrit.cloudera.org:8080/#/c/3343/2//COMMIT_MSG
Commit Message:

PS2, Line 7: Add retry to backend connection request
Maybe add a brief line explaining the motivation behind this.


http://gerrit.cloudera.org:8080/#/c/3343/4/be/src/runtime/client-cache.cc
File be/src/runtime/client-cache.cc:

Line 125: void ClientCacheHelper::ReleaseClient(ClientKey* client_key) {
I think it would be better to reset the timeouts to the default values when 
we're releasing the client back to the cache. So that the next user of said 
cached client wouldn't have funky timeouts pre-set.

(This makes sense only if you choose to remove SetRpcTimeout() from DoRpc() and 
put it in the ClientConnection() constructor instead)


http://gerrit.cloudera.org:8080/#/c/3343/4/be/src/runtime/client-cache.h
File be/src/runtime/client-cache.h:

PS4, Line 101: SetRpcTimeout
Would this be clearer as SetClientRpcTimeout() here and below? On first look I 
thought it was set as the default for the entire client cache.


PS4, Line 239: SetRpcTimeout(send_timeout, recv_timeout);
I'm a little worried about calling this every time we call a DoRpc() since this 
method uses a lock. It would be better if we could avoid this.

One possibility is to set it during the creation of the ClientConnection where 
we can call SetRpcTimeout() only if it is created with explicit timeouts. Else 
the default will automatically be applied to that ThriftClientImpl.

So the ClientConnection() constructor signature will become something like this:
ClientConnection(ClientCache<T>* client_cache, TNetworkAddress address, Status* 
status, int32_t send_timeout = -1, int32_t recv_timeout = -1)

And right now I don't think we need the functionality of setting a RPC timeout 
per RPC vs per client. (correct me if I'm wrong)


http://gerrit.cloudera.org:8080/#/c/3343/4/be/src/runtime/plan-fragment-executor.cc
File be/src/runtime/plan-fragment-executor.cc:

PS4, Line 556: is_cancelled_
Can this also be IsCompleted()? Is it a possibility that there is a sequence of 
events where 'closed_' or 'done_' is set and this still ends up being called?


Line 559:     UpdateStatus(Status(TErrorCode::CANCELLED, "Fragment already 
cancelled."));
Should this be an error? Or is it better to just return since the user will 
already see that the query is cancelled.


http://gerrit.cloudera.org:8080/#/c/3343/4/be/src/service/fragment-exec-state.cc
File be/src/service/fragment-exec-state.cc:

PS4, Line 128: cause problem
If we know exactly what sort of problems it could cause, it would be worth 
briefly mentioning here.


http://gerrit.cloudera.org:8080/#/c/3343/4/tests/custom_cluster/test_rpc_timeout.py
File tests/custom_cluster/test_rpc_timeout.py:

PS4, Line 39: assert "RPC" in str(ex)
'ex' should contain "RPC timed out" right? This would pass for "RPC Error" as 
well.


-- 
To view, visit http://gerrit.cloudera.org:8080/3343
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964
Gerrit-PatchSet: 4
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <[email protected]>
Gerrit-Reviewer: Alan Choi <[email protected]>
Gerrit-Reviewer: Henry Robinson <[email protected]>
Gerrit-Reviewer: Juan Yu <[email protected]>
Gerrit-Reviewer: Sailesh Mukil <[email protected]>
Gerrit-HasComments: Yes

Reply via email to