Andrew Wong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14177 )

Change subject: KUDU-2780: create thread for auto-rebalancing
......................................................................


Patch Set 19:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer-test.cc
File src/kudu/master/auto_rebalancer-test.cc:

http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer-test.cc@269
PS18, Line 269:     ASSERT_LT(0, iterations);
              :   });
              : }
              :
> If there's a discrepancy between the number of tservers the TSManager belie
>If there's a discrepancy between the number of tservers the TSManager believes 
>are available and the number of tservers the CatalogManager believes are up 
>with tables on them, then to be safe, the auto-rebalancer just exits the 
>current iteration and sleeps (exiting BuildClusterRawInfo()), and waits to see 
>if next cycle it will be consistent.

Right. My point is, can we test for that? Knowing how it works today is great, 
but as bits of implementation change here and there over time, having a test 
that will let us know if the user-facing behavior regresses is preferred to 
watching over the code via code review.

For instance, we might expect if we start with 3 tservers with replicas and one 
goes down, there should be no rebalancing at all because we'll never become 
healthy with just two tservers. If we start 4 tservers with replicas, with a 
low --tserver_unresponsive_timeout_ms and one goes down, we'd expect first for 
the replicas to recover, and then for the cluster to be balanced (maybe just 
test that eventually the cluster will look healthy and balanced?).

FWIW I don't think it's a blocker, but if the test is easy enough to write, 
it'd be great to get that in. Or maybe add it as a follow-up patch.


http://gerrit.cloudera.org:8080/#/c/14177/19/src/kudu/master/auto_rebalancer.cc
File src/kudu/master/auto_rebalancer.cc:

http://gerrit.cloudera.org:8080/#/c/14177/19/src/kudu/master/auto_rebalancer.cc@391
PS19, Line 391:   for (const auto& r : locs_pb.interned_replicas()) {
              :     if (r.role() == RaftPeerPB::LEADER) {
              :       int index = r.ts_info_idx();
              :       const TSInfoPB ts_info = 
*(ts_infos_dict.ts_info_pbs[index]);
              :       *leader_uuid = ts_info.permanent_uuid();
              :       const auto& addr = ts_info.rpc_addresses(0);
              :       leader_hp->set_host(addr.host());
              :       leader_hp->set_port(addr.port());
              :       break;
              :     }
              :   }
              :   return Status::OK();
Per the comment methods:

  // Gets information on the current leader replica for the specified tablet and
  // populates the 'leader_uuid' and 'leader_hp' output parameters. The
  // function returns Status::NotFound() if no replica is a leader for the 
tablet.

This should probably be something along the lines of:

for (r in replicas) {
  if (r is leader) {
    set leader properties
    return Status::OK();
  }
}
return Status::NotFound("couldn't find leader for <tablet_id>");

or something similar.


http://gerrit.cloudera.org:8080/#/c/14177/19/src/kudu/master/auto_rebalancer.cc@426
PS19, Line 426:       return Status::NotFound("Could not find leader replica's 
tserver");
nit: would be nice to include the leader's UUID in this error message



--
To view, visit http://gerrit.cloudera.org:8080/14177
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ifca25d1063c07047cf2123e6792b3c7395be20e4
Gerrit-Change-Number: 14177
Gerrit-PatchSet: 19
Gerrit-Owner: Hannah Nguyen <[email protected]>
Gerrit-Reviewer: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Andrew Wong <[email protected]>
Gerrit-Reviewer: Hannah Nguyen <[email protected]>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Mon, 09 Mar 2020 21:34:58 +0000
Gerrit-HasComments: Yes

Reply via email to