Mike Percy has posted comments on this change. Change subject: disk failure: don't fail CHECKs for disk failures ......................................................................
Patch Set 2: (2 comments) Like I alluded in some comments, I think this patch needs to be split out into one patch per change. Each removal of a CHECK should include tests for all of the new cases that used to be impossible to hit without a crash. I also think it would be good to have an overall error handling design documented, either in a doc or a header file or something, so that we have a base layer of error handling set up and then systematically implement the new error handling in each crash case until we don't crash anymore and we know we safely handle each former crash case. http://gerrit.cloudera.org:8080/#/c/7442/2/src/kudu/tablet/tablet_replica_mm_ops.cc File src/kudu/tablet/tablet_replica_mm_ops.cc: Line 32: DECLARE_bool(suicide_on_eio); unused? Line 137: LOG(ERROR) << Substitute("FlushMRS failed on $0", tablet_replica_->tablet_id()); Hrm. Is it safe to change a CHECK() to a LOG(ERROR) here? I would like to see tests for all the new possible error cases we are unwrapping here. Diving into FlushUnlocked(), it looks like we create a new MRS and then we call FlushInternal(). Can we fail to create a new MRS? In FlushInternal(), there is a CHECK based on state_ and I am not sure how that is enforced. I think this change by itself deserved its own patch. -- To view, visit http://gerrit.cloudera.org:8080/7442 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I109635a54268b9db741b2ae9ea3e9f1fe072d0a8 Gerrit-PatchSet: 2 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew Wong <[email protected]> Gerrit-Reviewer: David Ribeiro Alves <[email protected]> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy <[email protected]> Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-HasComments: Yes
