Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/14222 )
Change subject: KUDU-2069 p4: stop replication from failed servers in maintenance mode ...................................................................... Patch Set 2: (5 comments) http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/integration-tests/maintenance_mode-itest.cc File src/kudu/integration-tests/maintenance_mode-itest.cc: http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/integration-tests/maintenance_mode-itest.cc@99 PS2, Line 99: Status GetNumBytesCopied(int64_t* num_bytes_copied) { This is a proxy for the question "did we rereplicate anything", right? Would it be more intuitive if we looked at the count of the server entity's METRIC_handler_latency_kudu_tserver_TabletCopyService_BeginTabletCopySession instead? Also, you only ever compare the value to 0, so maybe convert this into a yes/no function? http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/integration-tests/maintenance_mode-itest.cc@151 PS2, Line 151: protected: Don't need this? http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/integration-tests/maintenance_mode-itest.cc@208 PS2, Line 208: int64_t num_bytes_copied = 0; Nit: don't need to initialize this. http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/master/catalog_manager.cc File src/kudu/master/catalog_manager.cc: http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/master/catalog_manager.cc@4190 PS2, Line 4190: unordered_set<string> whitelisted_uuids; : master_->ts_manager()->GetWhitelistedUuids(&whitelisted_uuids); It's wasteful to generate this for each tablet in the report. http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/master/ts_manager.h File src/kudu/master/ts_manager.h: http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/master/ts_manager.h@94 PS2, Line 94: void GetWhitelistedUuids(std::unordered_set<std::string>* uuids) const; "Whitelisted" doesn't really capture the expected semantics of these UUIDs. Another thing I noticed is that there's potentially a big difference in cardinality between the returned set (number of tservers in maintenance mode) and the number of iterations in ShouldAddReplica (replication factor). Taken together, perhaps we shouldn't generate this set up front, and instead pass the TSManager in some fashion into ShouldAddReplica? The consensus -> master dependency means we shouldn't pass in TSManager as-is, but maybe we could pass it via a lambda that takes a UUID and returns true if the tserver should be ignored for underreplication purposes? -- To view, visit http://gerrit.cloudera.org:8080/14222 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9a63b55011d16900c0d27eac0eb75880074204db Gerrit-Change-Number: 14222 Gerrit-PatchSet: 2 Gerrit-Owner: Andrew Wong <[email protected]> Gerrit-Reviewer: Adar Dembo <[email protected]> Gerrit-Reviewer: Alexey Serbin <[email protected]> Gerrit-Reviewer: Andrew Wong <[email protected]> Gerrit-Reviewer: Hao Hao <[email protected]> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Comment-Date: Fri, 13 Sep 2019 04:24:17 +0000 Gerrit-HasComments: Yes
