Henry Robinson has posted comments on this change.

Change subject: IMPALA-5388: Don't retry RPC calls on TSSLException
......................................................................


Patch Set 1:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/7063/1/be/src/runtime/client-cache.h
File be/src/runtime/client-cache.h:

PS1, Line 258:  Status(TErrorCode::RPC_GENERAL_ERROR, e.what());
> Should we consider returning RPC_RECV_TIMEOUT instead if e.what() contains 
I don't think so - RPC_RECV_TIMEOUT is only used elsewhere to properly print an 
error message.


PS1, Line 259: catch (const apache::thrift::TException& e) {
             :       if (IsRecvTimeoutTException(e)) {
             :         return Status(TErrorCode::RPC_RECV_TIMEOUT, 
strings::Substitute(
             :             "Client $0 timed-out during recv call.", 
TNetworkAddressToString(address_)));
             :       }
             :       VLOG(1) << "client " << client_ << " unexpected exception: 
"
             :               << e.what() << ", type=" << typeid(e).name();
             : 
             :       // Client may have unexpectedly been closed, so re-open 
and retry.
             :       // TODO: ThriftClient should return proper error codes.
             :       const Status& status = Reopen();
             :       if (!status.ok()) {
             :         if (retry_is_safe != NULL) *retry_is_safe = true;
             :         return Status(TErrorCode::RPC_CLIENT_CONNECT_FAILURE, 
status.GetDetail());
             :       }
             :       try {
             :         (client_->*f)(*response, request);
             :       } catch (apache::thrift::TException& e) {
             :         // By this point the RPC really has failed.
             :         // TODO: Revisit this logic later. It's possible that 
the new connection
             :         // works but we hit timeout here.
             :         return Status(TErrorCode::RPC_GENERAL_ERROR, e.what());
The more I stare at this, the more I think it's broken even without SSL. It's 
pretty clear to see that TExceptions can be thrown by TSocket on its read() 
path which would lead to a spurious retry in any case. It looks like TSocket 
gives a narrow set of error codes for the 'socket not open / conn reset' error 
cases that would be better used here.

In fact I just tried this, and by throwing a TException between writing and 
reading in one TransmitData() RPC, I can get wrong results pretty easily.

I think we need to restructure this block to narrow the retried RPCs only to 
those a) on the write path and b) that have the error code NOT_OPEN.


-- 
To view, visit http://gerrit.cloudera.org:8080/7063
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I176975f2aa521d5be8a40de51067b1497923d09b
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho <[email protected]>
Gerrit-Reviewer: Henry Robinson <[email protected]>
Gerrit-Reviewer: Michael Ho <[email protected]>
Gerrit-Reviewer: Sailesh Mukil <[email protected]>
Gerrit-HasComments: Yes

Reply via email to