Alexey Serbin has posted comments on this change. Change subject: [client] retry operation in case of ServiceUnavailable ......................................................................
Patch Set 6: (3 comments) http://gerrit.cloudera.org:8080/#/c/5964/6/src/kudu/client/client-internal.cc File src/kudu/client/client-internal.cc: PS6, Line 241: if (s.ok() && resp->has_error()) { : if (resp->error().code() == MasterErrorPB::NOT_THE_LEADER || : resp->error().code() == MasterErrorPB::CATALOG_MANAGER_NOT_INITIALIZED) { : if (client->IsMultiMaster()) { : KLOG_EVERY_N_SECS(INFO, 1) << "Determining the new leader Master and retrying..."; : WARN_NOT_OK(ConnectToCluster(client, deadline), : "Unable to determine the new leader Master"); : } : continue; : } else { : return StatusFromPB(resp->error().status()); : } : } > FWIW, I don't believe this case can be reached right now. I don't know if i During the testing I found that at least the (s.ok() and && resp->has_error()) was reachable. I touched this piece of the code because: 1) I saw there would be an issue of non-retrying the call if that (code == NOT_THE_LEADER || code == CATALOG_MANAGER_NOT_INITIALIZED) sub-criterion is hit in case of single-master cluster 2) I wanted to consolidate that StatusFromPB() stuff for the scenarios when (s.ok() && resp->has_error()) is hit. Do you mind if we address the reachability of the sub-criterion in a separate changelist? http://gerrit.cloudera.org:8080/#/c/5964/6/src/kudu/integration-tests/create-table-stress-test.cc File src/kudu/integration-tests/create-table-stress-test.cc: Line 337: SleepFor(MonoDelta::FromMilliseconds(1 + rand() % 10)); > Why this change? 1) Since CreateTable() calls are started to back-off exponentially, that metadata reload activity makes the test long very-very long. 2) I though it might be better to employ some randomness here because this test is supposed to expose race conditions, if any (at least the main test comment says so). Line 351: if (s.IsTimedOut()) { > Could we set the default admin operation timeout of the client real high an We could, but I'm not sure it's possible to calculate that timeout to make sure it works in every testing environment. The problem is with exponential-with-randomness back-off retry behavior. I'm open for suggestions here -- if you think we can amend this scenario to keep the essence but remove the unpredictability, I can try to implement it. -- To view, visit http://gerrit.cloudera.org:8080/5964 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Iae169ce1efe3950dfcb769594a1944d05e2ad8e7 Gerrit-PatchSet: 6 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Alexey Serbin <[email protected]> Gerrit-Reviewer: Adar Dembo <[email protected]> Gerrit-Reviewer: Alexey Serbin <[email protected]> Gerrit-Reviewer: Jean-Daniel Cryans <[email protected]> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-HasComments: Yes
