Henry Robinson has posted comments on this change.

Change subject: IMPALA-3575: Add retry to backend connection request and rpc 
timeout
......................................................................


Patch Set 15:

(11 comments)

How are you going to test all these failure paths for all the RPCs we have in 
the system? How do we convince ourselves that these failures correctly lead to 
query tear down in all cases?

http://gerrit.cloudera.org:8080/#/c/3343/15/be/src/runtime/client-cache.h
File be/src/runtime/client-cache.h:

Line 97:   void DestroyClient(ClientKey* client_key);
needs a comment


PS15, Line 197: hasRpcError_
C++-style naming


PS15, Line 239: IsTimeoutTException
should now be called IsRecvTimeoutTException()


PS15, Line 240: RPC_TIMEOUT
should be RPC_RECV_TIMEOUT


PS15, Line 246: if (strstr(e.what(),"unknown result") != NULL)
this seems very brittle, and at least should be wrapped in an IsXXXException() 
kind of method.


PS15, Line 268: release
back pressure is added by a blocking RPC, not released.


PS15, Line 273: timeout
what's the unit?


PS15, Line 272: Status DoRpcTimedWait(const F& f, const Request& request, 
Response* response,
              :       const G& wait_func, uint64_t timeout, RuntimeState* 
state, bool* can_retry = NULL) 
This seems like it breaks abstractions: not all RPCs happen in the context of a 
query. I also don't know what wait_func is for, based on the comments.

I think this might be better written as two methods, which would be called like 
this:

  Status status = client->DoRpc(...., &can_retry);
  while (status.code() == TErrorCode::RPC_TIMEOUT && can_retry) {
    check_if_cancelled_or_timeout();
    status = client->RetryRpcRecv(recv_func);
  }

I understand the idea behind pushing all this logic into this method, but I 
think it actually hides too much from the caller. This allows the caller to 
take appropriate action on network timeouts without handing in a runtime state 
to the client (e.g. the caller may want to do some more work before blocking 
again).

The other thing you could do is pass in a callback, so we'd have:

 Status status = client->DoRpc(...., [state]() { return state->is_cancelled(); 
}, ...);

but my fear is that would get a bit messy.


PS15, Line 282: bool no_timeout = timeout == 0;
this can be simplified by having a deadline variable:

  uint64_t deadline = MonotonicMillis() + timeout;


PS15, Line 309: bool hasRpcError_;
the role of this variable is not clear. Please add a comment.


http://gerrit.cloudera.org:8080/#/c/3343/15/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

PS15, Line 1490: // Try to send the RPC 3 times before failing.
Why try 3 times? Have you seen in your testing where there's failure on the 
first try, but success on attempts 2 or 3?


-- 
To view, visit http://gerrit.cloudera.org:8080/3343
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964
Gerrit-PatchSet: 15
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <[email protected]>
Gerrit-Reviewer: Alan Choi <[email protected]>
Gerrit-Reviewer: Dan Hecht <[email protected]>
Gerrit-Reviewer: Henry Robinson <[email protected]>
Gerrit-Reviewer: Juan Yu <[email protected]>
Gerrit-Reviewer: Sailesh Mukil <[email protected]>
Gerrit-HasComments: Yes

Reply via email to