Todd Lipcon has posted comments on this change. Change subject: disk failure: release failed txs from tracker ......................................................................
Patch Set 8: (6 comments) http://gerrit.cloudera.org:8080/#/c/7439/8/src/kudu/tablet/mvcc.h File src/kudu/tablet/mvcc.h: Line 404: void Cancel(); here's an idea: how about rather than specifically cancelling each transaction, there is an MvccManager::Shutdown() call, and then we allow AbortTransaction() of an APPLYING transaction only in the case that the MvccManager has entered shutdown state? then we don't need new per-transaction states, and it more accurately captures the idea that "you can only cancel operations once the entire tablet ahs started shutting itself down" (and we can still ensure that, on destruction, the map has been cleared out) http://gerrit.cloudera.org:8080/#/c/7439/8/src/kudu/tablet/transactions/transaction_driver.cc File src/kudu/tablet/transactions/transaction_driver.cc: Line 481: CHECK(s.IsDiskFailure()); per conversation on Slack, I think generalizing this to something like the following would separate the concerns a bit better: - on a disk error, mark the tablet as shutdown - when attempting to apply operations on a shutdown tablet, return Status::Aborted - if we get Status::Aborted here, we can CHECK that the tablet is in Shutdown() state and then go to the cancellation path -- basically this would be the same code as in HandleFailure today, except widened the scope to allow cancelling a replicating/replicated transaction in the case that the tablet is in failed state? (eg by taking a new parameter to HandleFailure indicating that it's due to the tablet shutdown?) http://gerrit.cloudera.org:8080/#/c/7439/8/src/kudu/tablet/transactions/write_transaction.cc File src/kudu/tablet/transactions/write_transaction.cc: Line 134: if (PREDICT_FALSE(tablet->IsInFailedDir())) { making ApplyRowOperations return a Status would allow centralizing this logic into Tablet a bit better and also generalizing to all "Shutdown" cases instead of just specifically the disk-failure one Line 139: tablet->ApplyRowOperations(state()); here if Tablet indicated that it has shut down then you can early-return without creating a commit message. that would be a pretty good safety net that we don't accidentally end up writing a commit message http://gerrit.cloudera.org:8080/#/c/7439/8/src/kudu/tserver/ts_tablet_manager.cc File src/kudu/tserver/ts_tablet_manager.cc: Line 1076: tablet->SetInFailedDir(); again maybe generalize to just 'MarkFailed' since the handling isn't particular to the reason for the failure. (eg if we detected some internal inconsistency like a bad checksum or something we may also want to markfailed even though it's not in a failed dir) Line 1078: LOG(FATAL) << Substitute("Data directory $0 failed. Disk failure handling not implemented", now implemented? -- To view, visit http://gerrit.cloudera.org:8080/7439 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I983620f27e7226806a2cca253db7619731914d42 Gerrit-PatchSet: 8 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew Wong <[email protected]> Gerrit-Reviewer: Andrew Wong <[email protected]> Gerrit-Reviewer: David Ribeiro Alves <[email protected]> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy <[email protected]> Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-HasComments: Yes
