David Ribeiro Alves has posted comments on this change. Change subject: disk failure: don't open tablets on failed disks ......................................................................
Patch Set 3: (9 comments) http://gerrit.cloudera.org:8080/#/c/7766/3/src/kudu/fs/file_block_manager.cc File src/kudu/fs/file_block_manager.cc: PS3, Line 581: // Don't report failed directories. shoudln't we _report_ failed directories? ok if you want to do it in a follow up but this is likely one of the most important things we should report PS3, Line 693: VLOG(3) << Substitute("Block $0 is in a failed directory; not deleting", : block_id.ToString()); how about using LOG_EVERY_N at INFO level, that way we'd get something in the logs but it woudn't spam http://gerrit.cloudera.org:8080/#/c/7766/3/src/kudu/fs/log_block_manager.cc File src/kudu/fs/log_block_manager.cc: PS3, Line 1700: VLOG(3) << Substitute("Block $0 is in a failed directory; not deleting", : block_id.ToString()); same also do we have checks for the other ops? why is delete particular in this regard? http://gerrit.cloudera.org:8080/#/c/7766/3/src/kudu/integration-tests/disk_failure-itest.cc File src/kudu/integration-tests/disk_failure-itest.cc: PS3, Line 56: if (GetParam() == "file") { : flags.emplace_back("--block_manager=file"); : } else { : flags.emplace_back("--block_manager=log"); : } maybe: flags.emplace_back(Substitute("--block_manager=$0", GetParam())); PS3, Line 65: void SetupDefaultCluster(vector<ExternalTabletServer*>* ext_tservers) { : FLAGS_num_tablet_servers = 3; : vector<string> flags = GetClusterFlags(); : NO_FATALS(CreateCluster("survivable_cluster", flags, {}, /* num_data_dirs */ 2)); : NO_FATALS(CreateClient(&client_)); : CHECK_EQ(FLAGS_num_tablet_servers, tablet_servers_.size()); : ext_tservers->clear(); : for (const auto& e : tablet_servers_) { : ext_tservers->push_back(cluster_->tablet_server_by_uuid(e.second->uuid())); : } : } can you add more disks? 2 is kind of special-casy because of tablet overlaps right? PS3, Line 124: NO_FATALS(SetServerSurvivalFlags(ext_tservers)); why is this not set on boot? PS3, Line 129: Substitute("--block_manager=$0", GetParam() why are you re-settting the block manager? can it change? PS3, Line 138: ASSERT_OK(cluster_->WaitForTabletsRunning(cluster_->tablet_server(0), : kNumTablets, kAgreementTimeout)); so remind me again what happened to the failed tablets? were they replicated? where? PS3, Line 137: // Ensure again that the tablets get to a running, consistent state. : ASSERT_OK(cluster_->WaitForTabletsRunning(cluster_->tablet_server(0), : kNumTablets, kAgreementTimeout)); : ASSERT_OK(WaitForServersToAgree(kAgreementTimeout, tablet_servers_, tablet_id, 1)); not sure I get this. you inject a failure, wait for it to happen and then make sure all tablets are started? and why are they agreeing on index 1, isn't that kind of a given even with the failures? -- To view, visit http://gerrit.cloudera.org:8080/7766 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Id3fae98355657f6aa4b134c542f92fc07f5c0aa1 Gerrit-PatchSet: 3 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew Wong <aw...@cloudera.com> Gerrit-Reviewer: Adar Dembo <a...@cloudera.com> Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com> Gerrit-Reviewer: David Ribeiro Alves <davidral...@gmail.com> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy <mpe...@apache.org> Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon <t...@apache.org> Gerrit-HasComments: Yes