Henry Robinson has posted comments on this change. Change subject: IMPALA-3575: Add retry to backend connection request and rpc timeout ......................................................................
Patch Set 5: (8 comments) http://gerrit.cloudera.org:8080/#/c/3343/5//COMMIT_MSG Commit Message: PS5, Line 17: unless you restart coordinato This is true if the remote peer does not send an RST (which would be the case with a kernel panic or network failure). We could consider using per-connection TCP keepalives to address this in the future. http://gerrit.cloudera.org:8080/#/c/3343/5/be/src/runtime/backend-client.h File be/src/runtime/backend-client.h: PS5, Line 48: // Consider back pressure, we should wait for receiver side as long as possible : // but not forever. Also we need to make sure Cancel() can stop the waiting. I preferred the old way where you could set per-RPC timeouts. You can combine that with per-client timeouts, if you think that's neater, but this custom logic is best avoided if possible. The preferred way to do this is to have TransmitData() return "not ready" as a status, and then have the client retry with an application-level timeout. PS5, Line 51: MonotonicNanos nit: use MontonicMillis() http://gerrit.cloudera.org:8080/#/c/3343/5/be/src/runtime/coordinator.cc File be/src/runtime/coordinator.cc: PS5, Line 1490: // Try to send the RPC 3 times before failing. : for (int i = 0; i < 3; ++i) { This always sends the RPC three times if it succeeds every time. Line 1492: rpc_status= backend_client.DoRpc( space before '=' PS5, Line 1494: if (rpc_status.code() == TErrorCode::RPC_TIMEOUT) break; If you abort after a timeout error, what's the rationale for sending the RPC three times? What error condition does the retry address? http://gerrit.cloudera.org:8080/#/c/3343/5/be/src/service/fragment-exec-state.cc File be/src/service/fragment-exec-state.cc: PS5, Line 127: // Otherwise Coordinator::UpdateFragmentExecStatus() could be called multiple times : // for the same finished PlanFragmentExecutor, num_remaining_fragment_instances_ will : // be decreased more than once and cause DCHECK. We should fix the bug in UpdateFragmentExecStatus(), rather than trying to avoid it here. If the first attempt didn't get through, but the executor_.IsCompleted() is true, won't that mean the status will never be sent? PS5, Line 138: // TODO: Do we really need to cancel? I think we definitely should cancel: this makes us think the coordinator has failed. -- To view, visit http://gerrit.cloudera.org:8080/3343 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964 Gerrit-PatchSet: 5 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <[email protected]> Gerrit-Reviewer: Alan Choi <[email protected]> Gerrit-Reviewer: Henry Robinson <[email protected]> Gerrit-Reviewer: Juan Yu <[email protected]> Gerrit-Reviewer: Sailesh Mukil <[email protected]> Gerrit-HasComments: Yes
