Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/14177 )
Change subject: KUDU-2780: create thread for auto-rebalancing ...................................................................... Patch Set 19: (3 comments) http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer-test.cc File src/kudu/master/auto_rebalancer-test.cc: http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer-test.cc@269 PS18, Line 269: ASSERT_LT(0, iterations); : }); : } : > If there's a discrepancy between the number of tservers the TSManager belie >If there's a discrepancy between the number of tservers the TSManager believes >are available and the number of tservers the CatalogManager believes are up >with tables on them, then to be safe, the auto-rebalancer just exits the >current iteration and sleeps (exiting BuildClusterRawInfo()), and waits to see >if next cycle it will be consistent. Right. My point is, can we test for that? Knowing how it works today is great, but as bits of implementation change here and there over time, having a test that will let us know if the user-facing behavior regresses is preferred to watching over the code via code review. For instance, we might expect if we start with 3 tservers with replicas and one goes down, there should be no rebalancing at all because we'll never become healthy with just two tservers. If we start 4 tservers with replicas, with a low --tserver_unresponsive_timeout_ms and one goes down, we'd expect first for the replicas to recover, and then for the cluster to be balanced (maybe just test that eventually the cluster will look healthy and balanced?). FWIW I don't think it's a blocker, but if the test is easy enough to write, it'd be great to get that in. Or maybe add it as a follow-up patch. http://gerrit.cloudera.org:8080/#/c/14177/19/src/kudu/master/auto_rebalancer.cc File src/kudu/master/auto_rebalancer.cc: http://gerrit.cloudera.org:8080/#/c/14177/19/src/kudu/master/auto_rebalancer.cc@391 PS19, Line 391: for (const auto& r : locs_pb.interned_replicas()) { : if (r.role() == RaftPeerPB::LEADER) { : int index = r.ts_info_idx(); : const TSInfoPB ts_info = *(ts_infos_dict.ts_info_pbs[index]); : *leader_uuid = ts_info.permanent_uuid(); : const auto& addr = ts_info.rpc_addresses(0); : leader_hp->set_host(addr.host()); : leader_hp->set_port(addr.port()); : break; : } : } : return Status::OK(); Per the comment methods: // Gets information on the current leader replica for the specified tablet and // populates the 'leader_uuid' and 'leader_hp' output parameters. The // function returns Status::NotFound() if no replica is a leader for the tablet. This should probably be something along the lines of: for (r in replicas) { if (r is leader) { set leader properties return Status::OK(); } } return Status::NotFound("couldn't find leader for <tablet_id>"); or something similar. http://gerrit.cloudera.org:8080/#/c/14177/19/src/kudu/master/auto_rebalancer.cc@426 PS19, Line 426: return Status::NotFound("Could not find leader replica's tserver"); nit: would be nice to include the leader's UUID in this error message -- To view, visit http://gerrit.cloudera.org:8080/14177 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ifca25d1063c07047cf2123e6792b3c7395be20e4 Gerrit-Change-Number: 14177 Gerrit-PatchSet: 19 Gerrit-Owner: Hannah Nguyen <[email protected]> Gerrit-Reviewer: Alexey Serbin <[email protected]> Gerrit-Reviewer: Andrew Wong <[email protected]> Gerrit-Reviewer: Hannah Nguyen <[email protected]> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Tidy Bot (241) Gerrit-Comment-Date: Mon, 09 Mar 2020 21:34:58 +0000 Gerrit-HasComments: Yes
