Alexey Serbin has posted comments on this change.

Change subject: [client] retry operation in case of ServiceUnavailable
......................................................................


Patch Set 6:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/5964/6/src/kudu/client/client-internal.cc
File src/kudu/client/client-internal.cc:

PS6, Line 241:     if (s.ok() && resp->has_error()) {
             :       if (resp->error().code() == MasterErrorPB::NOT_THE_LEADER 
||
             :           resp->error().code() == 
MasterErrorPB::CATALOG_MANAGER_NOT_INITIALIZED) {
             :         if (client->IsMultiMaster()) {
             :           KLOG_EVERY_N_SECS(INFO, 1) << "Determining the new 
leader Master and retrying...";
             :           WARN_NOT_OK(ConnectToCluster(client, deadline),
             :                       "Unable to determine the new leader 
Master");
             :         }
             :         continue;
             :       } else {
             :         return StatusFromPB(resp->error().status());
             :       }
             :     }
> FWIW, I don't believe this case can be reached right now. I don't know if i
During the testing I found that at least the (s.ok() and && resp->has_error()) 
was reachable.

I touched this piece of the code because:
1) I saw there would be an issue of non-retrying the call if that (code == 
NOT_THE_LEADER || code == CATALOG_MANAGER_NOT_INITIALIZED) sub-criterion is hit 
in case of single-master cluster
2) I wanted to consolidate that StatusFromPB() stuff for the scenarios when 
(s.ok() && resp->has_error()) is hit.

Do you mind if we address the reachability of the sub-criterion in a separate 
changelist?


http://gerrit.cloudera.org:8080/#/c/5964/6/src/kudu/integration-tests/create-table-stress-test.cc
File src/kudu/integration-tests/create-table-stress-test.cc:

Line 337:       SleepFor(MonoDelta::FromMilliseconds(1 + rand() % 10));
> Why this change?
1) Since CreateTable() calls are started to back-off exponentially, that 
metadata reload activity makes the test long very-very long.
2) I though it might be better to employ some randomness here because this test 
is supposed to expose race conditions, if any (at least the main test comment 
says so).


Line 351:     if (s.IsTimedOut()) {
> Could we set the default admin operation timeout of the client real high an
We could, but I'm not sure it's possible to calculate that timeout to make sure 
it works in every testing environment.  The problem is with 
exponential-with-randomness back-off retry behavior.

I'm open for suggestions here -- if you think we can amend this scenario to 
keep the essence but remove the unpredictability, I can try to implement it.


-- 
To view, visit http://gerrit.cloudera.org:8080/5964
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Iae169ce1efe3950dfcb769594a1944d05e2ad8e7
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Adar Dembo <[email protected]>
Gerrit-Reviewer: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Jean-Daniel Cryans <[email protected]>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <[email protected]>
Gerrit-HasComments: Yes

Reply via email to