Juan Yu has posted comments on this change. Change subject: IMPALA-3575: Add retry to backend connection request and rpc timeout ......................................................................
Patch Set 17: (17 comments) http://gerrit.cloudera.org:8080/#/c/3343/17/be/src/runtime/client-cache.h File be/src/runtime/client-cache.h: PS17, Line 203: if (client_ != NULL) { : if (!has_rpc_error_) { : client_cache_->ReleaseClient(&client_); : } else { : client_cache_->DestroyClient(&client_); : } : } > is this path covered by your tests - i.e. the connection will be reopened i yes, the test_random_rpc_timeout covers this. It runs multiple queries and each will hit rpc timeout, then the connection will be destroyed. The following query will create new connections. Line 301: /// Indicate the rpc call sent by this connection succeeds or not. If the rpc call fails > last rpc call? Done PS17, Line 304: has_rpc_error_ > I would call this something more descriptive, like "cnxn_not_recoverable_". How about just "last_rpc_succeed_" so it matches the logic, less confusion. http://gerrit.cloudera.org:8080/#/c/3343/17/be/src/runtime/data-stream-sender.cc File be/src/runtime/data-stream-sender.cc: PS17, Line 233: (timeout_ms == 0) ? 0 : > can remove this, and just check timeout_ms in the while condition. Done PS17, Line 236: deadline > timeout_ms Done http://gerrit.cloudera.org:8080/#/c/3343/17/be/src/service/fragment-exec-state.cc File be/src/service/fragment-exec-state.cc: PS17, Line 116: can_retry > call this "retry_is_safe" ? Done http://gerrit.cloudera.org:8080/#/c/3343/17/be/src/service/fragment-mgr.cc File be/src/service/fragment-mgr.cc: PS17, Line 37: #ifndef NDEBUG : DECLARE_int32(fault_injection_rpc_delay_ms); : DECLARE_int32(fault_injection_rpc_type); : #define FAULT_INJECTION_RPC_DALAY(type) InjectRpcDelay(type, \ : FLAGS_fault_injection_rpc_type, FLAGS_fault_injection_rpc_delay_ms) : #else : #define FAULT_INJECTION_RPC_DALAY(type) : #endif > would it be better to put this in fault-injection-util.h? Figured out the compiling issue. Yes, better to put this in fault-injection-util.h and avoid duplicate code. http://gerrit.cloudera.org:8080/#/c/3343/17/be/src/service/impala-server.cc File be/src/service/impala-server.cc: PS17, Line 186: DALAY > DELAY Done http://gerrit.cloudera.org:8080/#/c/3343/17/be/src/testutil/fault-injection-util.h File be/src/testutil/fault-injection-util.h: PS17, Line 15: _ > nit: remove trailing _ Done http://gerrit.cloudera.org:8080/#/c/3343/17/tests/custom_cluster/test_rpc_timeout.py File tests/custom_cluster/test_rpc_timeout.py: PS17, Line 34: i > nit: use _ to show that we don't care about the variable Done PS17, Line 61: test_runtime_filter_rpc > won't pytest try to run this as a test, since it's called test_*? it does. I'll change the name. PS17, Line 72: 3000 > is this definitely large enough to time out? I don't see if you verify that yes, since backend_client_rpc_timeout_ms is set to 1000. I verified log each RPC call did hit the timeout. however hit timeout doesn't necessary lead to query completely failure. It trigger fragment cancel, Cancel() just set runtime state cancel flag, doesn't force query execution stop. by the time execNode check cancel flag, the execution could be already complete and query finish successfully. Also some rpc failure not affect query execution, like the runtime filter one. The only one can guarantee query failure is the ExecPlanFragment(), I'll check query failure for it. PS17, Line 72: - > just want to check this works as intended with a missing '-' yes, both "-" and "--" work. but I'll make them consistent. PS17, Line 73: rpc_timeout1 > nit: can you call these "test_execplanfragment_timeout" etc? Done http://gerrit.cloudera.org:8080/#/c/3343/17/tests/query_test/test_lifecycle.py File tests/query_test/test_lifecycle.py: PS17, Line 36: \ > not needed Done PS17, Line 52: \ > remove Done http://gerrit.cloudera.org:8080/#/c/3343/17/tests/verifiers/metric_verifier.py File tests/verifiers/metric_verifier.py: PS17, Line 68: wait_for_metric_reset > Is this class really necessary, or could we use MetricVerifier and always p we usually need to check metrics for all impalad. I'll modify MetricVerifier to keep initial values. -- To view, visit http://gerrit.cloudera.org:8080/3343 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964 Gerrit-PatchSet: 17 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <[email protected]> Gerrit-Reviewer: Alan Choi <[email protected]> Gerrit-Reviewer: Dan Hecht <[email protected]> Gerrit-Reviewer: Henry Robinson <[email protected]> Gerrit-Reviewer: Huaisi Xu <[email protected]> Gerrit-Reviewer: Juan Yu <[email protected]> Gerrit-Reviewer: Sailesh Mukil <[email protected]> Gerrit-HasComments: Yes
