Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/14300 )
Change subject: KUDU-2800 Test long bootstrapping tablet replicas ...................................................................... Patch Set 10: (9 comments) http://gerrit.cloudera.org:8080/#/c/14300/10//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/14300/10//COMMIT_MSG@11 PS10, Line 11: : 1) Tablet replica bootstraps long time. No data modications happen during these time. Consensus state does not change. : : 2) Tablet replica is shutdown for some time. During this time a lot of data modifications happen, log rolls, replica is replaced (with re-replication). Replica restarts and find out the it fell behind. : : 3) Tablet replica bootstraps long time. During this time a lot of data modifications happen, log rolls, replica is replaced (with re-replication). Replica restarts and find out the it fell behind. nit: could you wrap these lines at 72 characters? per #6 of https://chris.beams.io/posts/git-commit/ http://gerrit.cloudera.org:8080/#/c/14300/10/src/kudu/integration-tests/raft_config_change-itest.cc File src/kudu/integration-tests/raft_config_change-itest.cc: http://gerrit.cloudera.org:8080/#/c/14300/10/src/kudu/integration-tests/raft_config_change-itest.cc@464 PS10, Line 464: ValidateConsensusState nit: maybe rename this so it's clearer from the callsite what it's trying to achieve? Something like ValidateReplicaMoved() or something? http://gerrit.cloudera.org:8080/#/c/14300/10/src/kudu/integration-tests/raft_config_change-itest.cc@484 PS10, Line 484: : FLAGS_num_tablet_servers = 4; : FLAGS_num_replicas = 3; : NO_FATALS(BuildAndStart( : {"--follower_unavailable_considered_failed_sec=3"}, : {"--tserver_unresponsive_timeout_ms=3000"})); : : ASSERT_EQ(4, tablet_servers_.size()); : : // Extra sanity checks. : vector<string> replica_servers = GetServersWithReplica(tablet_id_); : ASSERT_EQ(3, replica_servers.size()); : vector<string> no_replica_servers = GetServersWithoutReplica(tablet_id_); : ASSERT_EQ(1, no_replica_servers.size()); nit: is it important for the flags here to be different than the flags in the cases below? If not, maybe consider encapsulating in a function along the lines of SetupClusterForUnavailability() and reusing it in the below tests? http://gerrit.cloudera.org:8080/#/c/14300/10/src/kudu/integration-tests/raft_config_change-itest.cc@508 PS10, Line 508: ASSERT_EVENTUALLY([&] { Is this ASSERT_EVENTUALLY necessary? If the replicas have been already moved in the first place, won't this never converge? http://gerrit.cloudera.org:8080/#/c/14300/10/src/kudu/integration-tests/raft_config_change-itest.cc@516 PS10, Line 516: // Check that the replica at the tablet servers that was restarted is not removed. : ASSERT_TRUE(IsRaftConfigMember(replica_servers.front(), cstate.committed_config())); Should this check be done on all 3 of the replica_servers? http://gerrit.cloudera.org:8080/#/c/14300/10/src/kudu/integration-tests/raft_config_change-itest.cc@597 PS10, Line 597: replica->uuid() Isn't this just follower_uuid? http://gerrit.cloudera.org:8080/#/c/14300/10/src/kudu/integration-tests/raft_config_change-itest.cc@627 PS10, Line 627: : LOG(INFO) << "Waiting for log GC on " << leader->uuid(); : // Some WAL segments must exist, but wal segment 1 must not exist. : ASSERT_OK(inspect_->WaitForFilePatternInTabletWalDirOnTs( : leader_index, tablet_id_, { "wal-" }, { "wal-000000001" })); : LOG(INFO) << "Log GC complete on " << leader->uuid(); How long does it take for this test to finish? Would it make sense to flush more frequently via --flush_threshold_secs so the WALs are not anchored for as long? http://gerrit.cloudera.org:8080/#/c/14300/10/src/kudu/integration-tests/raft_config_change-itest.cc@635 PS10, Line 635: SleepFor(MonoDelta::FromSeconds(3)); nit: this probably isn't needed, right? Since we're going to retry the below check for up to 30 seconds http://gerrit.cloudera.org:8080/#/c/14300/10/src/kudu/tserver/ts_tablet_manager.cc File src/kudu/tserver/ts_tablet_manager.cc: http://gerrit.cloudera.org:8080/#/c/14300/10/src/kudu/tserver/ts_tablet_manager.cc@1077 PS10, Line 1077: << "Injecting " << FLAGS_tablet_open_bootstrap_inject_latency_ms : << "ms delay in tablet bootstrapping"; nit: just FYI Substitute could be used here, e.g. LOG(WARNING) << Substitute("Injecting $0 ms delay in in tablet bootstrapping", FLAGS_tablet_open_bootstrap_inject_latency_ms); -- To view, visit http://gerrit.cloudera.org:8080/14300 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ie1fee42053194f51d7a869ce14788095d6627ed9 Gerrit-Change-Number: 14300 Gerrit-PatchSet: 10 Gerrit-Owner: Volodymyr Verovkin <[email protected]> Gerrit-Reviewer: Adar Dembo <[email protected]> Gerrit-Reviewer: Alexey Serbin <[email protected]> Gerrit-Reviewer: Andrew Wong <[email protected]> Gerrit-Reviewer: Greg Solovyev <[email protected]> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Tidy Bot (241) Gerrit-Reviewer: Volodymyr Verovkin <[email protected]> Gerrit-Comment-Date: Fri, 04 Oct 2019 00:54:50 +0000 Gerrit-HasComments: Yes
