Juan Yu has posted comments on this change.

Change subject: IMPALA-3575: Add retry to backend connection request and rpc 
timeout
......................................................................


Patch Set 4:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/3343/4/be/src/runtime/client-cache.cc
File be/src/runtime/client-cache.cc:

Line 125: void ClientCacheHelper::ReleaseClient(ClientKey* client_key) {
> I think it would be better to reset the timeouts to the default values when
Good idea.
I am still running tests to see if we do need different timeout for different 
RPC calls.


http://gerrit.cloudera.org:8080/#/c/3343/4/be/src/runtime/client-cache.h
File be/src/runtime/client-cache.h:

PS4, Line 101: SetRpcTimeout
> Would this be clearer as SetClientRpcTimeout() here and below? On first loo
agreed


PS4, Line 239: SetRpcTimeout(send_timeout, recv_timeout);
> I'm a little worried about calling this every time we call a DoRpc() since 
You're right, set the timeout per client makes sense.


http://gerrit.cloudera.org:8080/#/c/3343/4/be/src/runtime/plan-fragment-executor.cc
File be/src/runtime/plan-fragment-executor.cc:

PS4, Line 556: is_cancelled_
> Can this also be IsCompleted()? Is it a possibility that there is a sequenc
StopReportThread is called if fragment is completed, or cancel happened during 
GetNext(). 
If an internal error happened at other time, it usually trigger internal 
cancelling, Cancel() will be called, 
But I agree, IsCompleted() seems a safer check here.


Line 559:     UpdateStatus(Status(TErrorCode::CANCELLED, "Fragment already 
cancelled."));
> Should this be an error? Or is it better to just return since the user will
I don't think this should be treat as error. the only issue is the report 
thread keep running and send status to coordinator which is confusing. 
I'll just call StopReportThread().


http://gerrit.cloudera.org:8080/#/c/3343/4/be/src/service/fragment-exec-state.cc
File be/src/service/fragment-exec-state.cc:

PS4, Line 128: cause problem
> If we know exactly what sort of problems it could cause, it would be worth 
UpdateFragmentExecStatus() will decrease num_remaining_fragment_instances_ 
counter, call it multiple times for a completed fragment will cause DCHECK.
I'll add those to comments.


http://gerrit.cloudera.org:8080/#/c/3343/4/tests/custom_cluster/test_rpc_timeout.py
File tests/custom_cluster/test_rpc_timeout.py:

PS4, Line 39: assert "RPC" in str(ex)
> 'ex' should contain "RPC timed out" right? This would pass for "RPC Error" 
We used to capture only one specific timeout case, and try reopen client for 
all other errors. I am not sure that's correct way so I modified the exception 
handling. But now I see more RPC error case than before. Maybe I should roll 
back the exception handling change.


-- 
To view, visit http://gerrit.cloudera.org:8080/3343
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964
Gerrit-PatchSet: 4
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <[email protected]>
Gerrit-Reviewer: Alan Choi <[email protected]>
Gerrit-Reviewer: Henry Robinson <[email protected]>
Gerrit-Reviewer: Juan Yu <[email protected]>
Gerrit-Reviewer: Sailesh Mukil <[email protected]>
Gerrit-HasComments: Yes

Reply via email to