Andrew Wong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17343 )

Change subject: [client] retry master RPCs on network errors
......................................................................


Patch Set 2:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/17343/1/src/kudu/client/txn_manager_proxy_rpc.cc
File src/kudu/client/txn_manager_proxy_rpc.cc:

http://gerrit.cloudera.org:8080/#/c/17343/1/src/kudu/client/txn_manager_proxy_rpc.cc@106
PS1, Line 106:   func_(client_->data_->txn_manager_proxy().get(), req_, resp_, 
controller,
             :         [this]() { this->SendRpcCb(Status::OK()); });
             : }
             :
> This check is already made above on L99.
Done


http://gerrit.cloudera.org:8080/#/c/17343/1/src/kudu/client/txn_manager_proxy_rpc.cc@110
PS1, Line 110: template <class ReqClass, class RespClass>
             : string AsyncRandomTxnManagerRpc<ReqClass, RespClass>::ToString() 
const {
             :   return rpc_name_;
             : }
             :
> How much is it differnt from returning std::shared_ptr by txn_manager_proxy
As it turns out, the issue I was seeing was because of the client disappearing 
out from under us. This check helped a bit because it added a distinct place 
for us to exit out. I reverted this to be a shared_ptr again and made sure we 
keep the client around in the user_cb_ by putting it in the KeepaliveRpcCtx.


http://gerrit.cloudera.org:8080/#/c/17343/1/src/kudu/client/txn_manager_proxy_rpc.cc@115
PS1, Line 115: mplate <class Req
> The same pattern is in AsyncLeaderMasterRpc<ReqClass, RespClass>::SendRpc()
Mentioned elsewhere, this seems specific to keepalive and I addressed it in a 
more tailored approach (making sure the KuduClient outlives these calls).


http://gerrit.cloudera.org:8080/#/c/17343/1/src/kudu/integration-tests/txn_status_table-itest.cc
File src/kudu/integration-tests/txn_status_table-itest.cc:

http://gerrit.cloudera.org:8080/#/c/17343/1/src/kudu/integration-tests/txn_status_table-itest.cc@477
PS1, Line 477: CHECK_OK(
> Just curious: does this work as expected when Restart() returns non-OK stat
Ah indeed. I replaced this with an ASSERT_TRUE(false) and saw the test proceed 
presumably until timeout. Replaced with a CHECK_OK().


http://gerrit.cloudera.org:8080/#/c/17343/1/src/kudu/integration-tests/txn_status_table-itest.cc@486
PS1, Line 486: MonoDelta::FromSeconds(10)))
> Just curios: does this turn to be enough to be stable in case of TSAN build
I haven't seen it fail, but I bumped it to 10s anyway.


http://gerrit.cloudera.org:8080/#/c/17343/1/src/kudu/integration-tests/txn_status_table-itest.cc@487
PS1, Line 487:
> nit: is this some sort of invariant we want to enforce in this scenario or
Done



--
To view, visit http://gerrit.cloudera.org:8080/17343
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iae2febb9e890acf6f7efd5cce3cb7e4f7b5f683d
Gerrit-Change-Number: 17343
Gerrit-PatchSet: 2
Gerrit-Owner: Andrew Wong <[email protected]>
Gerrit-Reviewer: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Andrew Wong <[email protected]>
Gerrit-Reviewer: Bankim Bhavsar <[email protected]>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Tue, 27 Apr 2021 21:52:40 +0000
Gerrit-HasComments: Yes

Reply via email to