Michael Ho has posted comments on this change. ( http://gerrit.cloudera.org:8080/12672 )
Change subject: IMPALA-8143: Enhance DoRpcWithRetry(). ...................................................................... Patch Set 2: (19 comments) http://gerrit.cloudera.org:8080/#/c/12672/5/be/src/rpc/rpc-mgr-test.h File be/src/rpc/rpc-mgr-test.h: http://gerrit.cloudera.org:8080/#/c/12672/5/be/src/rpc/rpc-mgr-test.h@296 PS5, Line 296: ERR typo http://gerrit.cloudera.org:8080/#/c/12672/5/be/src/rpc/rpc-mgr-test.cc File be/src/rpc/rpc-mgr-test.cc: http://gerrit.cloudera.org:8080/#/c/12672/5/be/src/rpc/rpc-mgr-test.cc@214 PS5, Line 214: ASSERT_OK(rpc_mgr_.RegisterService(10, 10, scan_mem_impl, : static_cast< Not sure if it's output of clang-tidy or something but I find it a bit hard to ready to split it like this. Can't we keep these 2 lines unchanged ? http://gerrit.cloudera.org:8080/#/c/12672/5/be/src/rpc/rpc-mgr-test.cc@263 PS5, Line 263: ueue f typo http://gerrit.cloudera.org:8080/#/c/12672/5/be/src/rpc/rpc-mgr-test.cc@291 PS5, Line 291: nullptr; Also do you want to assert that it's zero ? http://gerrit.cloudera.org:8080/#/c/12672/5/be/src/rpc/rpc-mgr-test.cc@298 PS5, Line 298: // Configure the service not used ? http://gerrit.cloudera.org:8080/#/c/12672/5/be/src/rpc/rpc-mgr-test.cc@316 PS5, Line 316: ASSERT_OK(ping_ Instead of sharing "sleep_ms" between RPCs, it may be cleaner to pass the sleep time as RPC argument instead. http://gerrit.cloudera.org:8080/#/c/12672/5/be/src/rpc/rpc-mgr-test.cc@322 PS5, Line 322: // A lambda t proxy-> http://gerrit.cloudera.org:8080/#/c/12672/5/be/src/rpc/rpc-mgr-test.cc@333 PS5, Line 333: : // wait for async call to be running : sleep_started.Wait(); : ASSERT_EQ(0, get_current_queue_size(rpc_mgr_)); : : // Do a second async call that will fill up the queue : RpcController async_controller2; : CountingBarrier async2_done(1); : // Reset the sleep time : sleep_ms = 100; : // Set the deadline to something reasonable otherwise the pings from DoRpcWithRetry : // will replace this async call in the queue. : MonoTime deadline = MonoTime::Now() + MonoDelta::FromSeconds(10); : async_controller2.set_deadline(deadline); : ResponseCallback async2_callback = [&]() { async2_done.Notify(); }; : proxy.get()->PingAsync(request, &response, &async_controller2, async2_callback); : : // Wait for second async to get queued : SleepForMs(300); : // Check the queue size : ASSERT_EQ(1, get_current_queue_size( These set of tests seem rather timing dependent. For instance, it may occur that the first time DoRpcWithRetry() is called will succeed because the async call somehow finishes by then due to some weird scheduling order. May be better to rethink a better way to test it. http://gerrit.cloudera.org:8080/#/c/12672/5/be/src/rpc/rpc-mgr-test.cc@378 PS5, Line 378: ASSERT_FALSE(async2_done.pending()); : : // Test injection of failures with DebugAction. : query_ctx.client_request.query_options.__set_debug_action("DoRpcWithRetry:FAIL"); : Status inject_status = RpcMgr::DoRpcWithRetry(ping_rpc, query_ctx, "ping failed", 1, : MonoDelta::FromSeconds(100), 0, "DoRpcWithRetry"); : ASSERT_FALSE(inject_status.ok()); : EXPECT_ERROR(inject_status, TErrorCo Please see comments in rpc-mgr.inline.h that we probably should only retry if the remote server is busy. For all other errors, we probably should report it right away to callers. We can consider transparently handling other types of errors in the future. http://gerrit.cloudera.org:8080/#/c/12672/5/be/src/rpc/rpc-mgr.h File be/src/rpc/rpc-mgr.h: http://gerrit.cloudera.org:8080/#/c/12672/5/be/src/rpc/rpc-mgr.h@178 PS5, Line 178: void ToJson(rapidjson::Document* document); : : /// Retry the Rpc 'rpc_call' up to 'times_to_try' times. : /// Each Rpc has a timeout of 'timeout'. : /// If the service is busy then sleep 'se Not needed as discussed in person. http://gerrit.cloudera.org:8080/#/c/12672/5/be/src/rpc/rpc-mgr.h@189 PS5, Line 189: the R typename http://gerrit.cloudera.org:8080/#/c/12672/5/be/src/rpc/rpc-mgr.h@191 PS5, Line 191: n the 'l It's important to highlight that this only works iff rpc_call is idempotent in the comments above. Any non-idempotent rpc_call using this interface may have undesired consequence. http://gerrit.cloudera.org:8080/#/c/12672/5/be/src/rpc/rpc-mgr.h@193 PS5, Line 193: ebugAction() to potentially inject errors. : template <typename F, typename L> It seems inconsistent to pass the timeout in as MonoDelta while passing the backoff time in second. To be consistent with the rest of the code in Impala, it may be better to pass the timeout as int64_t instead (e.g int64_t timeout_ms, int64_t backoff_ms) In addition, the backoff's unit is second, which may be too coarse grained. I would suggest using millisecond instead. http://gerrit.cloudera.org:8080/#/c/12672/5/be/src/rpc/rpc-mgr.h@201 PS5, Line 201: ) before destroying RpcMgr"; : } : : private: : /// One pool per registered service. sc Although this function is inlined, It has grown over time to take too many arguments. It may benefit from some refactoring if possible. One simplification is to considering hiding the query_ctx and debug action together into a simple lambda function which gets invoked inside DoRpcWithRetry(). http://gerrit.cloudera.org:8080/#/c/12672/5/be/src/rpc/rpc-mgr.inline.h File be/src/rpc/rpc-mgr.inline.h: http://gerrit.cloudera.org:8080/#/c/12672/5/be/src/rpc/rpc-mgr.inline.h@57 PS5, Line 57: times_to_try, const Mono May make sense to group this argument next to query_ctx as they are related. http://gerrit.cloudera.org:8080/#/c/12672/5/be/src/rpc/rpc-mgr.inline.h@61 PS5, Line 61: , d ++i http://gerrit.cloudera.org:8080/#/c/12672/5/be/src/rpc/rpc-mgr.inline.h@72 PS5, Line 72: Do we need to reset response between retry ? http://gerrit.cloudera.org:8080/#/c/12672/5/be/src/rpc/rpc-mgr.inline.h@76 PS5, Line 76: // Check for injected failures. : rpc_ Based on our offline discussion, this is not needed. Please consider removing it. http://gerrit.cloudera.org:8080/#/c/12672/5/be/src/rpc/rpc-mgr.inline.h@79 PS5, Line 79: } : : const kudu::Status& k_status = rpc_call(&rpc_controller); : rpc_status = FromKuduStatus(k_status, error_msg); : i So, it's not clear to me whether we should retry for errors other than RpcMgr::IsServerTooBusy(). In other words, if we hit other types of fatal errors, should we report them back to the caller right away ? Another possible way to treat it is that the code will retry regardless of the error types, assuming that any kind of error was due to some transient errors which may go away. In which case, it may make sense to backoff regardless of the error types, right ? What's your thought ? -- To view, visit http://gerrit.cloudera.org:8080/12672 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ia9693151c35e02235665b3c285a48c585973d390 Gerrit-Change-Number: 12672 Gerrit-PatchSet: 2 Gerrit-Owner: Andrew Sherman <asher...@cloudera.com> Gerrit-Reviewer: Andrew Sherman <asher...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Michael Ho <k...@cloudera.com> Gerrit-Reviewer: Thomas Marshall <tmarsh...@cloudera.com> Gerrit-Comment-Date: Thu, 28 Mar 2019 00:15:45 +0000 Gerrit-HasComments: Yes