Alexey Serbin has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20067 )

Change subject: KUDU-3458 Startup tserver even if some metadata is corrupted
......................................................................


Patch Set 4:

(22 comments)

http://gerrit.cloudera.org:8080/#/c/20067/4/src/kudu/integration-tests/ts_tablet_manager-itest.cc
File src/kudu/integration-tests/ts_tablet_manager-itest.cc:

http://gerrit.cloudera.org:8080/#/c/20067/4/src/kudu/integration-tests/ts_tablet_manager-itest.cc@1009
PS4, Line 1009: tablet
replica?


http://gerrit.cloudera.org:8080/#/c/20067/4/src/kudu/integration-tests/ts_tablet_manager-itest.cc@1009
PS4, Line 1009: table
tablet?


http://gerrit.cloudera.org:8080/#/c/20067/4/src/kudu/integration-tests/ts_tablet_manager-itest.cc@1010
PS4, Line 1010: rf=3
RF=3


http://gerrit.cloudera.org:8080/#/c/20067/4/src/kudu/integration-tests/ts_tablet_manager-itest.cc@1010
PS4, Line 1010: recopied
It's called 're-replicated' in Kudu's terminology.


http://gerrit.cloudera.org:8080/#/c/20067/4/src/kudu/integration-tests/ts_tablet_manager-itest.cc@1010
PS4, Line 1010: Tablet
Data


http://gerrit.cloudera.org:8080/#/c/20067/4/src/kudu/integration-tests/ts_tablet_manager-itest.cc@1011
PS4, Line 1011: table
tablet?


http://gerrit.cloudera.org:8080/#/c/20067/4/src/kudu/integration-tests/ts_tablet_manager-itest.cc@1011
PS4, Line 1011: tablet
replicas?


http://gerrit.cloudera.org:8080/#/c/20067/4/src/kudu/integration-tests/ts_tablet_manager-itest.cc@1014
PS4, Line 1014: table
tablet


http://gerrit.cloudera.org:8080/#/c/20067/4/src/kudu/integration-tests/ts_tablet_manager-itest.cc@1014
PS4, Line 1014: 1 of its tablets
the only replica it has


http://gerrit.cloudera.org:8080/#/c/20067/4/src/kudu/integration-tests/ts_tablet_manager-itest.cc@1020
PS4, Line 1020:   const char* const kTableNameR3Failed3 = "test-table-r3-3f";
              :   const char* const kTableNameR3Failed1 = "test-table-r3-1f";
              :   const char* const kTableNameR1 = "test-table-r1";
              :   const int kExpectedTabletCountsBefore = 3 * 3 + 3 * 3 + 3 * 1;
              :   const int kExpectedTabletCountsAfter = 
kExpectedTabletCountsBefore - 4;
              :   const int kNumTabletServers = 3;
              :   const int kNumPartitions = 3;
              :   const int kRowsCount = 100;
nit: maybe, consider adding the constexpr specifier?


http://gerrit.cloudera.org:8080/#/c/20067/4/src/kudu/integration-tests/ts_tablet_manager-itest.cc@1029
PS4, Line 1029:   FLAGS_follower_unavailable_considered_failed_sec = 2;
Would be great to comment on why customizing these flags.


http://gerrit.cloudera.org:8080/#/c/20067/4/src/kudu/integration-tests/ts_tablet_manager-itest.cc@1090
PS4, Line 1090:  std::
nit for here and elsewhere: could drop the 'std' prefix since there are 
corresponding 'using std::xxx' directives in the beginning of this file


http://gerrit.cloudera.org:8080/#/c/20067/4/src/kudu/integration-tests/ts_tablet_manager-itest.cc@1094
PS4, Line 1094:   // Search for bucket 0 tablet_id
              :   {
              :     MiniTabletServer* mts = cluster_->mini_tablet_server(0);
              :     for (const auto& tablet_id : mts->ListTablets()) {
              :       scoped_refptr<kudu::tablet::TabletReplica> replica;
              :       
ASSERT_OK(mts->server()->tablet_manager()->GetTabletReplica(tablet_id, 
&replica));
              :       auto meta_data = replica->tablet_metadata();
              :       if (corrupt_1_tablet_id.empty() && 
meta_data->table_name() == kTableNameR3Failed1 &&
              :           meta_data->partition().hash_buckets()[0] == 0) {
              :         corrupt_1_tablet_id = tablet_id;
              :       }
              :       if (corrupt_3_tablet_id.empty() && 
meta_data->table_name() == kTableNameR3Failed3 &&
              :           meta_data->partition().hash_buckets()[0] == 0) {
              :         corrupt_3_tablet_id = tablet_id;
              :       }
              :     }
              :   }
You could get rid of this complexity by creating test tables with just a single 
tablet.


http://gerrit.cloudera.org:8080/#/c/20067/4/src/kudu/integration-tests/ts_tablet_manager-itest.cc@1144
PS4, Line 1144: files_to_keep
nit: should this be const?


http://gerrit.cloudera.org:8080/#/c/20067/4/src/kudu/integration-tests/ts_tablet_manager-itest.cc@1159
PS4, Line 1159: Make the meta-files corrupt by writing some characters in the 
middle
What if a metadata file becomes empty or just absent?  Consider verifying the 
behavior of the bootstrapping code in that case as well.


http://gerrit.cloudera.org:8080/#/c/20067/4/src/kudu/integration-tests/ts_tablet_manager-itest.cc@1166
PS4, Line 1166:     meta_file.flush();
nit: isn't the stream object going to flush automatically on destruction while 
calling close()?


http://gerrit.cloudera.org:8080/#/c/20067/4/src/kudu/integration-tests/ts_tablet_manager-itest.cc@1169
PS4, Line 1169: unchange
?


http://gerrit.cloudera.org:8080/#/c/20067/4/src/kudu/integration-tests/ts_tablet_manager-itest.cc@1186
PS4, Line 1186: 
SleepFor(MonoDelta::FromSeconds(FLAGS_follower_unavailable_considered_failed_sec));
Could this be a point for flakiness due to scheduler anomalies, etc.?  Did you 
try to run this new test scenario multiple times using dist-test?


http://gerrit.cloudera.org:8080/#/c/20067/4/src/kudu/tserver/ts_tablet_manager.h
File src/kudu/tserver/ts_tablet_manager.h:

http://gerrit.cloudera.org:8080/#/c/20067/4/src/kudu/tserver/ts_tablet_manager.h@432
PS4, Line 432: wich
that


http://gerrit.cloudera.org:8080/#/c/20067/4/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

http://gerrit.cloudera.org:8080/#/c/20067/4/src/kudu/tserver/ts_tablet_manager.cc@942
PS4, Line 942:   // Elements only inserted at startup and only removed later.
             :   // So no need to lock for empty check.
But the elements are deleted just below, and it can be done in multiple 
threads.  The unordered_map::empty() method doesn't seem to be thread-safe in 
that regard, no?


http://gerrit.cloudera.org:8080/#/c/20067/4/src/kudu/tserver/ts_tablet_manager.cc@944
PS4, Line 944:   if (!tablets_with_corrupted_metadata_.empty()) {
In addition to details about locking, it would be great to comment on why it's 
necessary to perform such a clean up.


http://gerrit.cloudera.org:8080/#/c/20067/4/src/kudu/tserver/ts_tablet_manager.cc@949
PS4, Line 949:     }
             :     if (need_cleanup) {
             :       CALLBACK_RETURN_NOT_OK(
             :           
Env::Default()->DeleteFile(fs_manager_->GetTabletMetadataPath(tablet_id)));
             :       
Env::Default()->DeleteFile(fs_manager_->GetConsensusMetadataPath(tablet_id));
             :     }
I'm not sure that's the right way to do this.  At least, there is  
TabletMetadata::DeleteSuperBlock() that is used to properly remove the tablet's 
metadata.  I guess a similar story is for remove the Raft consensus metadata.  
You could do more code spelunking to find out more on how to properly do this.



--
To view, visit http://gerrit.cloudera.org:8080/20067
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Id44abaef5238b1de99350d8db43494a7120de608
Gerrit-Change-Number: 20067
Gerrit-PatchSet: 4
Gerrit-Owner: Zoltan Martonka <zmarto...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <ale...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Wed, 12 Jul 2023 02:00:24 +0000
Gerrit-HasComments: Yes

Reply via email to