Gabriella Lotz has posted comments on this change. ( http://gerrit.cloudera.org:8080/23755 )
Change subject: auto_rebalancer: enforce max_moves_per_server per tserver ...................................................................... Patch Set 4: (6 comments) http://gerrit.cloudera.org:8080/#/c/23755/3/src/kudu/master/auto_rebalancer.cc File src/kudu/master/auto_rebalancer.cc: http://gerrit.cloudera.org:8080/#/c/23755/3/src/kudu/master/auto_rebalancer.cc@a411 PS3, Line 411: : > I see this TODO is removed in PS3. Has this been addressed as of PS3 of th No, this TODO has not been addressed, it was my error deleting it. I've added it back. http://gerrit.cloudera.org:8080/#/c/23755/3/src/kudu/master/auto_rebalancer.cc@267 PS3, Line 267: } while (!replica_moves.empty()); > Do we want to add a few DCHECK() assertions on the state of the newly intro Done http://gerrit.cloudera.org:8080/#/c/23755/3/src/kudu/master/auto_rebalancer.cc@336 PS3, Line 336: > Since accessing FLAGS_auto_rebalancing_max_moves_per_server has extra costs Done http://gerrit.cloudera.org:8080/#/c/23755/3/src/kudu/master/auto_rebalancer.cc@337 PS3, Line 337: Status AutoRebalancerTask::GetMovesUsingRebalancingAlgo( : const ClusterRawInfo& raw_info, : rebalance::RebalancingAlgo* algo, : C > Please make sure 'num_tservers' and 'max_moves' end up being signed integer Done http://gerrit.cloudera.org:8080/#/c/23755/3/src/kudu/master/auto_rebalancer.cc@367 PS3, Line 367: RETURN_NOT_OK(algo->GetNextMoves(cluster_info, max_moves, &moves)); : > I'd think of comparing (src_ongoing + dst_ongoing) against the limit set by Could you clarify what you mean by comparing (src_ongoing + dst_ongoing) against the limit? My understanding is that src_ongoing is the count for the source tserver and dst_ongoing is for the destination tserver. Adding them would give the combined count across both servers, which doesn't match the flag semantics of limiting each individual server. Or are you suggesting we should track the total operations on a server (as both source and destination combined)? If so, I think that's already what moves_per_tserver_ does, it increments for both source and destination when a move is scheduled. http://gerrit.cloudera.org:8080/#/c/23755/3/src/kudu/master/auto_rebalancer.cc@723 PS3, Line 723: replica_moves->erase(replica_moves->begi > Why would we have zeros there if the move that has been scheduled should ha The counter can legitimately be zero if ExecuteMoves() failed before processing this move. ExecuteMoves() uses RETURN_NOT_OK which exits early on error, leaving later moves in replica_moves with zero counters. The if() check handles this case. I've added a comment to clarify. An alternative would be refactoring ExecuteMoves() to remove failed moves from replica_moves, but that's a larger change I'd prefer to defer to keep this patch focused. -- To view, visit http://gerrit.cloudera.org:8080/23755 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic1b73045719dc69cc9e3353d318b7381f144a20d Gerrit-Change-Number: 23755 Gerrit-PatchSet: 4 Gerrit-Owner: Gabriella Lotz <[email protected]> Gerrit-Reviewer: Abhishek Chennaka <[email protected]> Gerrit-Reviewer: Alexey Serbin <[email protected]> Gerrit-Reviewer: Gabriella Lotz <[email protected]> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Comment-Date: Fri, 09 Jan 2026 21:55:35 +0000 Gerrit-HasComments: Yes
