Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/7439 )
Change subject: mvcc: allow tablet shutdown without completing txs ...................................................................... Patch Set 20: (20 comments) http://gerrit.cloudera.org:8080/#/c/7439/20/src/kudu/integration-tests/ts_tablet_manager-itest.cc File src/kudu/integration-tests/ts_tablet_manager-itest.cc: http://gerrit.cloudera.org:8080/#/c/7439/20/src/kudu/integration-tests/ts_tablet_manager-itest.cc@130 PS20, Line 130: LOG(INFO) << "Closing replica's MVCC manager"; this should probably be less specific to implementation details http://gerrit.cloudera.org:8080/#/c/7439/20/src/kudu/integration-tests/ts_tablet_manager-itest.cc@179 PS20, Line 179: TEST_F(TsTabletManagerITest, TestStoppedTabletsScansGetRedirected) { I'm not sure this test suite is the best place for these tests, since they aren't really testing that much about the TsTabletManager. Maybe a new suite like tablet_shutdown-itest.cc ? http://gerrit.cloudera.org:8080/#/c/7439/20/src/kudu/integration-tests/ts_tablet_manager-itest.cc@183 PS20, Line 183: NO_FATALS(StopTablet(tablet_id)); here you're stopping a random server, not necessarily the one that the scanner(s) are reading from, right? is it worth trying to explicitly stop the leader? http://gerrit.cloudera.org:8080/#/c/7439/20/src/kudu/integration-tests/ts_tablet_manager-itest.cc@187 PS20, Line 187: reads.set_num_read_threads(1); are the readers using fault-tolerant scans by default here? http://gerrit.cloudera.org:8080/#/c/7439/20/src/kudu/integration-tests/ts_tablet_manager-itest.cc@197 PS20, Line 197: reads.StopAndJoin(); Somewhere we have a utility function like AssertNoCrashes(). Otherwise maybe the server you stopped might have hit a crash but you wouldn't notice because the reader just failed over to another replica http://gerrit.cloudera.org:8080/#/c/7439/20/src/kudu/integration-tests/ts_tablet_manager-itest.cc@205 PS20, Line 205: NO_FATALS(StartClusterAndLoadTablet(kNumReplicas, &tablet_id)); in these tests it might be beneficial to configure the MM such that it constantly flushes - eg set the flush threshold to 1mb? I think we have some tests elsewhere that do this. http://gerrit.cloudera.org:8080/#/c/7439/20/src/kudu/integration-tests/ts_tablet_manager-itest.cc@235 PS20, Line 235: writes.StopAndJoin(); is it possible in this test to re-start the tserver that has the stopped tablet to make sure that it can properly replay its WAL and catch back up to the other replicas? ie to make sure that we didn't get some divergent state? http://gerrit.cloudera.org:8080/#/c/7439/20/src/kudu/integration-tests/ts_tablet_manager-itest.cc@238 PS20, Line 238: TEST_F(TsTabletManagerITest, TestStoppedFollowerTabletsDontBlockWrites) { can you combine this test with the above one, using a utility function that takes an enum or boolean indicating whether to stop the leader or follower? http://gerrit.cloudera.org:8080/#/c/7439/20/src/kudu/tablet/mvcc-test.cc File src/kudu/tablet/mvcc-test.cc: http://gerrit.cloudera.org:8080/#/c/7439/20/src/kudu/tablet/mvcc-test.cc@498 PS20, Line 498: ASSERT_OK(s); : waiting_thread.join(); don't you want to join before assert here? http://gerrit.cloudera.org:8080/#/c/7439/20/src/kudu/tablet/mvcc-test.cc@598 PS20, Line 598: ASSERT_EVENTUALLY([&] { : std::lock_guard<simple_spinlock> l(status_lock); : ASSERT_STR_CONTAINS(s.ToString(), "closed"); : ASSERT_TRUE(s.IsAborted()); : }); : waiting_thread.join(); : if you reordered the join and the ASSERT then you could avoid the complexity of the lock and the ASSERT_EVENTUALLY http://gerrit.cloudera.org:8080/#/c/7439/20/src/kudu/tablet/mvcc.h File src/kudu/tablet/mvcc.h: http://gerrit.cloudera.org:8080/#/c/7439/20/src/kudu/tablet/mvcc.h@253 PS20, Line 253: an error specify the type of status http://gerrit.cloudera.org:8080/#/c/7439/20/src/kudu/tablet/mvcc.h@254 PS20, Line 254: Status WaitForApplyingTransactionsToCommit() const; worth adding a WARN_UNUSED_RESULT here http://gerrit.cloudera.org:8080/#/c/7439/20/src/kudu/tablet/mvcc.h@275 PS20, Line 275: now not? http://gerrit.cloudera.org:8080/#/c/7439/20/src/kudu/tablet/mvcc.h@280 PS20, Line 280: IsClosed rename to is_closed since it's inlinable short method http://gerrit.cloudera.org:8080/#/c/7439/20/src/kudu/tablet/tablet.h File src/kudu/tablet/tablet.h: http://gerrit.cloudera.org:8080/#/c/7439/20/src/kudu/tablet/tablet.h@160 PS20, Line 160: Status ApplyRowOperations(WriteTransactionState* tx_state); mind adding a WARN_UNUSED_RESULT to these functions that you're changing from void to Status returns? That way we'll notice if we missed updating any call sites http://gerrit.cloudera.org:8080/#/c/7439/20/src/kudu/tablet/tablet.h@405 PS20, Line 405: should nit: "should" is kinda fuzzy. Do you mean "will"? Nicer for callers to think about what guarantees this method provides and which things are more "best-effort". http://gerrit.cloudera.org:8080/#/c/7439/20/src/kudu/tablet/tablet.h@405 PS20, Line 405: // Stops the tablet. Once called, currently-Applying operations should : // terminate early, and further transactions should be aborted. : void Stop(); Would be good to move this declaration up next to 'Shutdown()' and clarify the comment to explain the difference between Stop and Shutdown. Do both have to be called? Which order? What's the difference? Also should document whether this is synchronous or not (ie after this returns is there any guarantee that ops have stopped, etc?) http://gerrit.cloudera.org:8080/#/c/7439/20/src/kudu/tablet/tablet.h@465 PS20, Line 465: Status BulkCheckPresence(WriteTransactionState* tx_state); add doc for what bad status means. Same above with the Apply calls. http://gerrit.cloudera.org:8080/#/c/7439/20/src/kudu/tablet/tablet.cc File src/kudu/tablet/tablet.cc: http://gerrit.cloudera.org:8080/#/c/7439/20/src/kudu/tablet/tablet.cc@806 PS20, Line 806: CHECK(Stopped()) << "MVCC must be closed to survive transaction failures"; I think it's worth a comment here explaining why we expect it to already have marked the tablet failed. I guess some lower layer is already handling a synchronous notification which leads back to this tablet? http://gerrit.cloudera.org:8080/#/c/7439/20/src/kudu/tablet/tablet_replica.cc File src/kudu/tablet/tablet_replica.cc: http://gerrit.cloudera.org:8080/#/c/7439/20/src/kudu/tablet/tablet_replica.cc@263 PS20, Line 263: if (tablet_) { is it ok to access tablet_ outside of state_change_lock_? can't quite recall the purpose of this lock but strange to see this outside of the lock with the rest of the stuff below inside it -- To view, visit http://gerrit.cloudera.org:8080/7439 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I983620f27e7226806a2cca253db7619731914d42 Gerrit-Change-Number: 7439 Gerrit-PatchSet: 20 Gerrit-Owner: Andrew Wong <[email protected]> Gerrit-Reviewer: Andrew Wong <[email protected]> Gerrit-Reviewer: David Ribeiro Alves <[email protected]> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy <[email protected]> Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-Comment-Date: Fri, 27 Oct 2017 20:14:06 +0000 Gerrit-HasComments: Yes
