Zoltan Martonka has posted comments on this change. ( http://gerrit.cloudera.org:8080/20067 )
Change subject: [WIP] KUDU-3458 Startup tserver even if some metadata is corrupted ...................................................................... Patch Set 5: (11 comments) http://gerrit.cloudera.org:8080/#/c/20067/4/src/kudu/integration-tests/ts_tablet_manager-itest.cc File src/kudu/integration-tests/ts_tablet_manager-itest.cc: http://gerrit.cloudera.org:8080/#/c/20067/4/src/kudu/integration-tests/ts_tablet_manager-itest.cc@1009 PS4, Line 1009: pecta > tablet? Thank you, I will fix the table/tablet/replica mess-up. http://gerrit.cloudera.org:8080/#/c/20067/4/src/kudu/integration-tests/ts_tablet_manager-itest.cc@1010 PS4, Line 1010: s all 3 > It's called 're-replicated' in Kudu's terminology. Thanks http://gerrit.cloudera.org:8080/#/c/20067/4/src/kudu/integration-tests/ts_tablet_manager-itest.cc@1020 PS4, Line 1020: constexpr char kTableNameR3Failed1[] = "test-table-r3-1f"; : constexpr char kTableNameR1[] = "test-table-r1"; : constexpr int kExpectedTabletCountsBefore = 3 + 3 + 1; : constexpr int kExpectedTabletCountsAfter = kExpectedTabletCountsBefore - 4; : constexpr int kNumTabletServers = 3; : constexpr int kRowsCount = 100; : // If a tablet can not be re-replicated properly, then kudu should not touch the corrupted files. : // There is no good way to > nit: maybe, consider adding the constexpr specifier? ok http://gerrit.cloudera.org:8080/#/c/20067/4/src/kudu/integration-tests/ts_tablet_manager-itest.cc@1029 PS4, Line 1029: > Would be great to comment on why customizing these flags. Because they were customised in re-replication tests where a tserver disappears, and I wrongly thought the master will wait with re-replication for the same time if only a tablet is missing from the restarted tserver. It seem to notice the missing tablet instantly. I will remove the flag settings. http://gerrit.cloudera.org:8080/#/c/20067/4/src/kudu/integration-tests/ts_tablet_manager-itest.cc@1090 PS4, Line 1090: > nit for here and elsewhere: could drop the 'std' prefix since there are cor ok http://gerrit.cloudera.org:8080/#/c/20067/4/src/kudu/integration-tests/ts_tablet_manager-itest.cc@1094 PS4, Line 1094: ->fs_manager() : ->GetTabletMetadataPath(tablet_name); : }; : : create_table_with_some_data(kTableNameR3Failed1, 3); : create_table_with_some_data(kTableNameR3Failed3, 3); : create_table_with_some_data(kTableNameR1, 1); : : ASSERT_EVENTUALLY([&] { ASSERT_EQ(kExpectedTabletCountsBefore, all_tablet_count()); }); : : ASSERT_OK(cluster_->mini_master()->WaitForCatalogManagerInit()); : : string corrupt_1_tablet_id = get_tablet_id(kTableNameR3Failed1); : string corrupt_3_tablet_id = get_tablet_id(kTableNameR3Failed3); : string corrupt_unreplicated_tablet_id = get_tablet_id(kTableNameR1); : : A > You could get rid of this complexity by creating test tables with just a si ok http://gerrit.cloudera.org:8080/#/c/20067/4/src/kudu/integration-tests/ts_tablet_manager-itest.cc@1144 PS4, Line 1144: > nit: should this be const? ok http://gerrit.cloudera.org:8080/#/c/20067/4/src/kudu/integration-tests/ts_tablet_manager-itest.cc@1166 PS4, Line 1166: for (int i = 0; i < > nit: isn't the stream object going to flush automatically on destruction wh yes it will. http://gerrit.cloudera.org:8080/#/c/20067/4/src/kudu/integration-tests/ts_tablet_manager-itest.cc@1186 PS4, Line 1186: std::istreambuf_iterator<char>(f), > Could this be a point for flakiness due to scheduler anomalies, etc.? Did We wait for re-replication in the previous line. ASSERT_EVENTUALY has a 30 sec timeout by default, which should be sufficient. I wanted to test if tserver leaves the other unrestorable metadata files untouched. I don't know if there is any better method to test if something does not happen. It can't be flaky, because if the logic is right then it will always pass, and if the logic is wrong, then we might get a false PASSED for not waiting enough for the server to do what it should not. I run it 100 times using dist-test, it seems to be stable. http://gerrit.cloudera.org:8080/#/c/20067/4/src/kudu/tserver/ts_tablet_manager.cc File src/kudu/tserver/ts_tablet_manager.cc: http://gerrit.cloudera.org:8080/#/c/20067/4/src/kudu/tserver/ts_tablet_manager.cc@942 PS4, Line 942: // * When tablet copy is done the tablet still needs to bootstrap which will : // take some time. > But the elements are deleted just below, and it can be done in multiple thr No, it is not. I will fix this. http://gerrit.cloudera.org:8080/#/c/20067/4/src/kudu/tserver/ts_tablet_manager.cc@944 PS4, Line 944: // > In addition to details about locking, it would be great to comment on why i ok -- To view, visit http://gerrit.cloudera.org:8080/20067 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Id44abaef5238b1de99350d8db43494a7120de60 Gerrit-Change-Number: 20067 Gerrit-PatchSet: 5 Gerrit-Owner: Zoltan Martonka <[email protected]> Gerrit-Reviewer: Alexey Serbin <[email protected]> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Marton Greber <[email protected]> Gerrit-Reviewer: Tidy Bot (241) Gerrit-Reviewer: Wang Xixu <[email protected]> Gerrit-Reviewer: Zoltan Martonka <[email protected]> Gerrit-Comment-Date: Fri, 21 Jul 2023 12:24:44 +0000 Gerrit-HasComments: Yes
