Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/14223 )
Change subject: KUDU-2069 p5: recheck tablet health when exiting maintenance mode ...................................................................... Patch Set 2: (5 comments) http://gerrit.cloudera.org:8080/#/c/14223/2//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/14223/2//COMMIT_MSG@33 PS2, Line 33: While somewhat hamfisted, on a reasonably dense cluster, it doesn't seem : unlikely that every tserver might be hosting a leader. I don't think this is hamfisted at all once the enter/exit MM operations work on a batch of tservers. http://gerrit.cloudera.org:8080/#/c/14223/2/src/kudu/master/catalog_manager.cc File src/kudu/master/catalog_manager.cc: http://gerrit.cloudera.org:8080/#/c/14223/2/src/kudu/master/catalog_manager.cc@4989 PS2, Line 4989: Status CatalogManager::SetTServerState(const string& tserver_uuid, TServerState state) { Might be a little clearer if SetAllTServers... were called instead in master_service.cc when this function succeeds. That way all of the management of the new in-memory state is in just one place. http://gerrit.cloudera.org:8080/#/c/14223/2/src/kudu/master/catalog_manager.cc@5003 PS2, Line 5003: ts_manager->SetAllTServersRequireFullTabletReport(); > Could it be that the tserver already exited maintenance mode? Wouldn't we have short-circuited out via L4997 then? http://gerrit.cloudera.org:8080/#/c/14223/2/src/kudu/master/ts_manager.cc File src/kudu/master/ts_manager.cc: http://gerrit.cloudera.org:8080/#/c/14223/2/src/kudu/master/ts_manager.cc@234 PS2, Line 234: void TSManager::SetAllTServersRequireFullTabletReport() { Needs to acquire lock_ in shared mode, no? Surprised TSAN didn't catch this; do we have no multi-threaded test coverage including this function and another that takes lock_? Could you update the comments in ts_manager.h to make it more clear when lock_ needs to be acquired? http://gerrit.cloudera.org:8080/#/c/14223/2/src/kudu/master/ts_manager.cc@244 PS2, Line 244: for (auto& id_and_desc : servers_by_id_) { : id_and_desc.second->UpdateNeedsFullTabletReport(false); : } Why does this belong here? It's not intuitive since this isn't part of the "tserver state". -- To view, visit http://gerrit.cloudera.org:8080/14223 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic0ab3d78cbc5b1228c01592a00118f11f76e43dd Gerrit-Change-Number: 14223 Gerrit-PatchSet: 2 Gerrit-Owner: Andrew Wong <[email protected]> Gerrit-Reviewer: Adar Dembo <[email protected]> Gerrit-Reviewer: Alexey Serbin <[email protected]> Gerrit-Reviewer: Andrew Wong <[email protected]> Gerrit-Reviewer: Hao Hao <[email protected]> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Comment-Date: Fri, 13 Sep 2019 04:34:32 +0000 Gerrit-HasComments: Yes
