David Ribeiro Alves has posted comments on this change.

Change subject: disk failure: don't open tablets on failed disks
......................................................................


Patch Set 3:

(9 comments)

http://gerrit.cloudera.org:8080/#/c/7766/3/src/kudu/fs/file_block_manager.cc
File src/kudu/fs/file_block_manager.cc:

PS3, Line 581: // Don't report failed directories.
shoudln't we _report_ failed directories? ok if you want to do it in a follow 
up but this is likely one of the most important things we should report


PS3, Line 693: VLOG(3) << Substitute("Block $0 is in a failed directory; not 
deleting",
             :                             block_id.ToString());
how about using LOG_EVERY_N at INFO level, that way we'd get something in the 
logs but it woudn't spam


http://gerrit.cloudera.org:8080/#/c/7766/3/src/kudu/fs/log_block_manager.cc
File src/kudu/fs/log_block_manager.cc:

PS3, Line 1700:       VLOG(3) << Substitute("Block $0 is in a failed directory; 
not deleting",
              :                             block_id.ToString());
same
also do we have checks for the other ops? why is delete particular in this 
regard?


http://gerrit.cloudera.org:8080/#/c/7766/3/src/kudu/integration-tests/disk_failure-itest.cc
File src/kudu/integration-tests/disk_failure-itest.cc:

PS3, Line 56:     if (GetParam() == "file") {
            :       flags.emplace_back("--block_manager=file");
            :     } else {
            :       flags.emplace_back("--block_manager=log");
            :     }
maybe: flags.emplace_back(Substitute("--block_manager=$0", GetParam()));


PS3, Line 65:   void SetupDefaultCluster(vector<ExternalTabletServer*>* 
ext_tservers) {
            :     FLAGS_num_tablet_servers = 3;
            :     vector<string> flags = GetClusterFlags();
            :     NO_FATALS(CreateCluster("survivable_cluster", flags, {}, /* 
num_data_dirs */ 2));
            :     NO_FATALS(CreateClient(&client_));
            :     CHECK_EQ(FLAGS_num_tablet_servers, tablet_servers_.size());
            :     ext_tservers->clear();
            :     for (const auto& e : tablet_servers_) {
            :       
ext_tservers->push_back(cluster_->tablet_server_by_uuid(e.second->uuid()));
            :     }
            :   }
can you add more disks? 2 is kind of special-casy because of tablet overlaps 
right?


PS3, Line 124:   NO_FATALS(SetServerSurvivalFlags(ext_tservers));
why is this not set on boot?


PS3, Line 129: Substitute("--block_manager=$0", GetParam()
why are you re-settting the block manager? can it change?


PS3, Line 138:   
ASSERT_OK(cluster_->WaitForTabletsRunning(cluster_->tablet_server(0),
             :                                             kNumTablets, 
kAgreementTimeout));
so remind me again what happened to the failed tablets? were they replicated? 
where?


PS3, Line 137:   // Ensure again that the tablets get to a running, consistent 
state.
             :   
ASSERT_OK(cluster_->WaitForTabletsRunning(cluster_->tablet_server(0),
             :                                             kNumTablets, 
kAgreementTimeout));
             :   ASSERT_OK(WaitForServersToAgree(kAgreementTimeout, 
tablet_servers_, tablet_id, 1));
not sure I get this. you inject a failure, wait for it to happen and then make 
sure all tablets are started? and why are they agreeing on index 1, isn't that 
kind of a given even with the failures?


-- 
To view, visit http://gerrit.cloudera.org:8080/7766
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id3fae98355657f6aa4b134c542f92fc07f5c0aa1
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <a...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <davidral...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mpe...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <t...@apache.org>
Gerrit-HasComments: Yes

Reply via email to