Henry Robinson has posted comments on this change.

Change subject: IMPALA-3575: Add retry to backend connection request and rpc 
timeout
......................................................................


Patch Set 5:

(8 comments)

http://gerrit.cloudera.org:8080/#/c/3343/5//COMMIT_MSG
Commit Message:

PS5, Line 17: unless you restart coordinato
This is true if the remote peer does not send an RST (which would be the case 
with a kernel panic or network failure). We could consider using per-connection 
TCP keepalives to address this in the future.


http://gerrit.cloudera.org:8080/#/c/3343/5/be/src/runtime/backend-client.h
File be/src/runtime/backend-client.h:

PS5, Line 48: // Consider back pressure, we should wait for receiver side as 
long as possible
            :     // but not forever. Also we need to make sure Cancel() can 
stop the waiting.
I preferred the old way where you could set per-RPC timeouts. You can combine 
that with per-client timeouts, if you think that's neater, but this custom 
logic is best avoided if possible.

The preferred way to do this is to have TransmitData() return "not ready" as a 
status, and then have the client retry with an application-level timeout.


PS5, Line 51: MonotonicNanos
nit: use MontonicMillis()


http://gerrit.cloudera.org:8080/#/c/3343/5/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

PS5, Line 1490: // Try to send the RPC 3 times before failing.
              :     for (int i = 0; i < 3; ++i) {
This always sends the RPC three times if it succeeds every time.


Line 1492:       rpc_status= backend_client.DoRpc(
space before '='


PS5, Line 1494: if (rpc_status.code() == TErrorCode::RPC_TIMEOUT) break;
If you abort after a timeout error, what's the rationale for sending the RPC 
three times? What error condition does the retry address?


http://gerrit.cloudera.org:8080/#/c/3343/5/be/src/service/fragment-exec-state.cc
File be/src/service/fragment-exec-state.cc:

PS5, Line 127: // Otherwise Coordinator::UpdateFragmentExecStatus() could be 
called multiple times
             :     // for the same finished PlanFragmentExecutor, 
num_remaining_fragment_instances_ will
             :     // be decreased more than once and cause DCHECK.
We should fix the bug in UpdateFragmentExecStatus(), rather than trying to 
avoid it here. If the first attempt didn't get through, but the 
executor_.IsCompleted() is true, won't that mean the status will never be sent?


PS5, Line 138: // TODO: Do we really need to cancel?
I think we definitely should cancel: this makes us think the coordinator has 
failed.


-- 
To view, visit http://gerrit.cloudera.org:8080/3343
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964
Gerrit-PatchSet: 5
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <[email protected]>
Gerrit-Reviewer: Alan Choi <[email protected]>
Gerrit-Reviewer: Henry Robinson <[email protected]>
Gerrit-Reviewer: Juan Yu <[email protected]>
Gerrit-Reviewer: Sailesh Mukil <[email protected]>
Gerrit-HasComments: Yes

Reply via email to