Todd Lipcon has posted comments on this change.

Change subject: disk failure: release failed txs from tracker

Patch Set 8:

File src/kudu/tablet/mvcc.h:

Line 404:   void Cancel();
here's an idea: how about rather than specifically cancelling each transaction, 
there is an MvccManager::Shutdown() call, and then we allow AbortTransaction() 
of an APPLYING transaction only in the case that the MvccManager has entered 
shutdown state?

then we don't need new per-transaction states, and it more accurately captures 
the idea that "you can only cancel operations once the entire tablet ahs 
started shutting itself down"

(and we can still ensure that, on destruction, the map has been cleared out)
File src/kudu/tablet/transactions/

Line 481:       CHECK(s.IsDiskFailure());
per conversation on Slack, I think generalizing this to something like the 
following would separate the concerns a bit better:

- on a disk error, mark the tablet as shutdown
- when attempting to apply operations on a shutdown tablet, return 
- if we get Status::Aborted here, we can CHECK that the tablet is in Shutdown() 
state and then go to the cancellation path
-- basically this would be the same code as in HandleFailure today, except 
widened the scope to allow cancelling a replicating/replicated transaction in 
the case that the tablet is in failed state? (eg by taking a new parameter to 
HandleFailure indicating that it's due to the tablet shutdown?)
File src/kudu/tablet/transactions/

Line 134:   if (PREDICT_FALSE(tablet->IsInFailedDir())) {
making ApplyRowOperations return a Status would allow centralizing this logic 
into Tablet a bit better and also generalizing to all "Shutdown" cases instead 
of just specifically the disk-failure one

Line 139:   tablet->ApplyRowOperations(state());
here if Tablet indicated that it has shut down then you can early-return 
without creating a commit message. that would be a pretty good safety net that 
we don't accidentally end up writing a commit message
File src/kudu/tserver/

Line 1076:     tablet->SetInFailedDir();
again maybe generalize to just 'MarkFailed' since the handling isn't particular 
to the reason for the failure. (eg if we detected some internal inconsistency 
like a bad checksum or something we may also want to markfailed even though 
it's not in a failed dir)

Line 1078:   LOG(FATAL) << Substitute("Data directory $0 failed. Disk failure 
handling not implemented",
now implemented?

To view, visit
To unsubscribe, visit

Gerrit-MessageType: comment
Gerrit-Change-Id: I983620f27e7226806a2cca253db7619731914d42
Gerrit-PatchSet: 8
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <>
Gerrit-Reviewer: Andrew Wong <>
Gerrit-Reviewer: David Ribeiro Alves <>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <>
Gerrit-HasComments: Yes

Reply via email to