Todd Lipcon has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7439 )

Change subject: mvcc: allow tablet shutdown without completing txs
......................................................................


Patch Set 20:

(20 comments)

http://gerrit.cloudera.org:8080/#/c/7439/20/src/kudu/integration-tests/ts_tablet_manager-itest.cc
File src/kudu/integration-tests/ts_tablet_manager-itest.cc:

http://gerrit.cloudera.org:8080/#/c/7439/20/src/kudu/integration-tests/ts_tablet_manager-itest.cc@130
PS20, Line 130:     LOG(INFO) << "Closing replica's MVCC manager";
this should probably be less specific to implementation details


http://gerrit.cloudera.org:8080/#/c/7439/20/src/kudu/integration-tests/ts_tablet_manager-itest.cc@179
PS20, Line 179: TEST_F(TsTabletManagerITest, 
TestStoppedTabletsScansGetRedirected) {
I'm not sure this test suite is the best place for these tests, since they 
aren't really testing that much about the TsTabletManager. Maybe a new suite 
like tablet_shutdown-itest.cc ?


http://gerrit.cloudera.org:8080/#/c/7439/20/src/kudu/integration-tests/ts_tablet_manager-itest.cc@183
PS20, Line 183:   NO_FATALS(StopTablet(tablet_id));
here you're stopping a random server, not necessarily the one that the 
scanner(s) are reading from, right? is it worth trying to explicitly stop the 
leader?


http://gerrit.cloudera.org:8080/#/c/7439/20/src/kudu/integration-tests/ts_tablet_manager-itest.cc@187
PS20, Line 187:   reads.set_num_read_threads(1);
are the readers using fault-tolerant scans by default here?


http://gerrit.cloudera.org:8080/#/c/7439/20/src/kudu/integration-tests/ts_tablet_manager-itest.cc@197
PS20, Line 197:   reads.StopAndJoin();
Somewhere we have a utility function like AssertNoCrashes(). Otherwise maybe 
the server you stopped might have hit a crash but you wouldn't notice because 
the reader just failed over to another replica


http://gerrit.cloudera.org:8080/#/c/7439/20/src/kudu/integration-tests/ts_tablet_manager-itest.cc@205
PS20, Line 205:   NO_FATALS(StartClusterAndLoadTablet(kNumReplicas, 
&tablet_id));
in these tests it might be beneficial to configure the MM such that it 
constantly flushes - eg set the flush threshold to 1mb? I think we have some 
tests elsewhere that do this.


http://gerrit.cloudera.org:8080/#/c/7439/20/src/kudu/integration-tests/ts_tablet_manager-itest.cc@235
PS20, Line 235:   writes.StopAndJoin();
is it possible in this test to re-start the tserver that has the stopped tablet 
to make sure that it can properly replay its WAL and catch back up to the other 
replicas? ie to make sure that we didn't get some divergent state?


http://gerrit.cloudera.org:8080/#/c/7439/20/src/kudu/integration-tests/ts_tablet_manager-itest.cc@238
PS20, Line 238: TEST_F(TsTabletManagerITest, 
TestStoppedFollowerTabletsDontBlockWrites) {
can you combine this test with the above one, using a utility function that 
takes an enum or boolean indicating whether to stop the leader or follower?


http://gerrit.cloudera.org:8080/#/c/7439/20/src/kudu/tablet/mvcc-test.cc
File src/kudu/tablet/mvcc-test.cc:

http://gerrit.cloudera.org:8080/#/c/7439/20/src/kudu/tablet/mvcc-test.cc@498
PS20, Line 498: ASSERT_OK(s);
              :   waiting_thread.join();
don't you want to join before assert here?


http://gerrit.cloudera.org:8080/#/c/7439/20/src/kudu/tablet/mvcc-test.cc@598
PS20, Line 598:   ASSERT_EVENTUALLY([&] {
              :     std::lock_guard<simple_spinlock> l(status_lock);
              :     ASSERT_STR_CONTAINS(s.ToString(), "closed");
              :     ASSERT_TRUE(s.IsAborted());
              :   });
              :   waiting_thread.join();
              :
if you reordered the join and the ASSERT then you could avoid the complexity of 
the lock and the ASSERT_EVENTUALLY


http://gerrit.cloudera.org:8080/#/c/7439/20/src/kudu/tablet/mvcc.h
File src/kudu/tablet/mvcc.h:

http://gerrit.cloudera.org:8080/#/c/7439/20/src/kudu/tablet/mvcc.h@253
PS20, Line 253: an error
specify the type of status


http://gerrit.cloudera.org:8080/#/c/7439/20/src/kudu/tablet/mvcc.h@254
PS20, Line 254:   Status WaitForApplyingTransactionsToCommit() const;
worth adding a WARN_UNUSED_RESULT here


http://gerrit.cloudera.org:8080/#/c/7439/20/src/kudu/tablet/mvcc.h@275
PS20, Line 275: now
not?


http://gerrit.cloudera.org:8080/#/c/7439/20/src/kudu/tablet/mvcc.h@280
PS20, Line 280: IsClosed
rename to is_closed since it's inlinable short method


http://gerrit.cloudera.org:8080/#/c/7439/20/src/kudu/tablet/tablet.h
File src/kudu/tablet/tablet.h:

http://gerrit.cloudera.org:8080/#/c/7439/20/src/kudu/tablet/tablet.h@160
PS20, Line 160:   Status ApplyRowOperations(WriteTransactionState* tx_state);
mind adding a WARN_UNUSED_RESULT to these functions that you're changing from 
void to Status returns? That way we'll notice if we missed updating any call 
sites


http://gerrit.cloudera.org:8080/#/c/7439/20/src/kudu/tablet/tablet.h@405
PS20, Line 405: should
nit: "should" is kinda fuzzy. Do you mean "will"? Nicer for callers to think 
about what guarantees this method provides and which things are more 
"best-effort".


http://gerrit.cloudera.org:8080/#/c/7439/20/src/kudu/tablet/tablet.h@405
PS20, Line 405:   // Stops the tablet. Once called, currently-Applying 
operations should
              :   // terminate early, and further transactions should be 
aborted.
              :   void Stop();
Would be good to move this declaration up next to 'Shutdown()' and clarify the 
comment to explain the difference between Stop and Shutdown. Do both have to be 
called? Which order? What's the difference?

Also should document whether this is synchronous or not (ie after this returns 
is there any guarantee that ops have stopped, etc?)


http://gerrit.cloudera.org:8080/#/c/7439/20/src/kudu/tablet/tablet.h@465
PS20, Line 465:   Status BulkCheckPresence(WriteTransactionState* tx_state);
add doc for what bad status means. Same above with the Apply calls.


http://gerrit.cloudera.org:8080/#/c/7439/20/src/kudu/tablet/tablet.cc
File src/kudu/tablet/tablet.cc:

http://gerrit.cloudera.org:8080/#/c/7439/20/src/kudu/tablet/tablet.cc@806
PS20, Line 806:         CHECK(Stopped()) << "MVCC must be closed to survive 
transaction failures";
I think it's worth a comment here explaining why we expect it to already have 
marked the tablet failed. I guess some lower layer is already handling a 
synchronous notification which leads back to this tablet?


http://gerrit.cloudera.org:8080/#/c/7439/20/src/kudu/tablet/tablet_replica.cc
File src/kudu/tablet/tablet_replica.cc:

http://gerrit.cloudera.org:8080/#/c/7439/20/src/kudu/tablet/tablet_replica.cc@263
PS20, Line 263:   if (tablet_) {
is it ok to access tablet_ outside of state_change_lock_? can't quite recall 
the purpose of this lock but strange to see this outside of the lock with the 
rest of the stuff below inside it



--
To view, visit http://gerrit.cloudera.org:8080/7439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I983620f27e7226806a2cca253db7619731914d42
Gerrit-Change-Number: 7439
Gerrit-PatchSet: 20
Gerrit-Owner: Andrew Wong <[email protected]>
Gerrit-Reviewer: Andrew Wong <[email protected]>
Gerrit-Reviewer: David Ribeiro Alves <[email protected]>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <[email protected]>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <[email protected]>
Gerrit-Comment-Date: Fri, 27 Oct 2017 20:14:06 +0000
Gerrit-HasComments: Yes

Reply via email to