Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/17343 )
Change subject: [client] retry master RPCs on network errors ...................................................................... Patch Set 2: (6 comments) http://gerrit.cloudera.org:8080/#/c/17343/1/src/kudu/client/txn_manager_proxy_rpc.cc File src/kudu/client/txn_manager_proxy_rpc.cc: http://gerrit.cloudera.org:8080/#/c/17343/1/src/kudu/client/txn_manager_proxy_rpc.cc@106 PS1, Line 106: func_(client_->data_->txn_manager_proxy().get(), req_, resp_, controller, : [this]() { this->SendRpcCb(Status::OK()); }); : } : > This check is already made above on L99. Done http://gerrit.cloudera.org:8080/#/c/17343/1/src/kudu/client/txn_manager_proxy_rpc.cc@110 PS1, Line 110: template <class ReqClass, class RespClass> : string AsyncRandomTxnManagerRpc<ReqClass, RespClass>::ToString() const { : return rpc_name_; : } : > How much is it differnt from returning std::shared_ptr by txn_manager_proxy As it turns out, the issue I was seeing was because of the client disappearing out from under us. This check helped a bit because it added a distinct place for us to exit out. I reverted this to be a shared_ptr again and made sure we keep the client around in the user_cb_ by putting it in the KeepaliveRpcCtx. http://gerrit.cloudera.org:8080/#/c/17343/1/src/kudu/client/txn_manager_proxy_rpc.cc@115 PS1, Line 115: mplate <class Req > The same pattern is in AsyncLeaderMasterRpc<ReqClass, RespClass>::SendRpc() Mentioned elsewhere, this seems specific to keepalive and I addressed it in a more tailored approach (making sure the KuduClient outlives these calls). http://gerrit.cloudera.org:8080/#/c/17343/1/src/kudu/integration-tests/txn_status_table-itest.cc File src/kudu/integration-tests/txn_status_table-itest.cc: http://gerrit.cloudera.org:8080/#/c/17343/1/src/kudu/integration-tests/txn_status_table-itest.cc@477 PS1, Line 477: CHECK_OK( > Just curious: does this work as expected when Restart() returns non-OK stat Ah indeed. I replaced this with an ASSERT_TRUE(false) and saw the test proceed presumably until timeout. Replaced with a CHECK_OK(). http://gerrit.cloudera.org:8080/#/c/17343/1/src/kudu/integration-tests/txn_status_table-itest.cc@486 PS1, Line 486: MonoDelta::FromSeconds(10))) > Just curios: does this turn to be enough to be stable in case of TSAN build I haven't seen it fail, but I bumped it to 10s anyway. http://gerrit.cloudera.org:8080/#/c/17343/1/src/kudu/integration-tests/txn_status_table-itest.cc@487 PS1, Line 487: > nit: is this some sort of invariant we want to enforce in this scenario or Done -- To view, visit http://gerrit.cloudera.org:8080/17343 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iae2febb9e890acf6f7efd5cce3cb7e4f7b5f683d Gerrit-Change-Number: 17343 Gerrit-PatchSet: 2 Gerrit-Owner: Andrew Wong <[email protected]> Gerrit-Reviewer: Alexey Serbin <[email protected]> Gerrit-Reviewer: Andrew Wong <[email protected]> Gerrit-Reviewer: Bankim Bhavsar <[email protected]> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Comment-Date: Tue, 27 Apr 2021 21:52:40 +0000 Gerrit-HasComments: Yes
