Andrew Wong has posted comments on this change. Change subject: disk failure: reassign failed tablets ......................................................................
Patch Set 8: (5 comments) http://gerrit.cloudera.org:8080/#/c/7440/7/src/kudu/client/scanner-internal.cc File src/kudu/client/scanner-internal.cc: PS7, Line 232: case tserver::TabletServerErrorPB::TABLET_FAILED: // fall-through > would it make more sense to have this be like: TABLET_NOT_FOUND? How do we Hrm, maybe, but I'm keeping this as is for now. Reasoning here was that before when a tablet was in the FAILED state, we would treat it as TABLET_NOT_RUNNING. I'm looking in client/scanner-internal.cc and it seems like we blacklist the location for TNR (if there's somewhere else I should be looking, please let me know). I'm not sure it makes sense to retry on TNR. I suppose it could retry if the tablet were NOT_STARTED or BOOTSTRAPPING, but tablets in QUIESCING and SHUTDOWN are also considered NOT_RUNNING. http://gerrit.cloudera.org:8080/#/c/7440/7/src/kudu/consensus/consensus_peers.cc File src/kudu/consensus/consensus_peers.cc: PS7, Line 284: sponse_.error().code() == TabletServerErrorPB::TABLET_FAILED) > maybe in this case we should directly call: NotifyObserversOfFailedFollower Done. http://gerrit.cloudera.org:8080/#/c/7440/7/src/kudu/consensus/consensus_queue.cc File src/kudu/consensus/consensus_queue.cc: PS7, Line 638: // Initiate Tablet Copy on the peer if the tablet is not found. : if (response.has_error()) { : CHECK_EQ(tserver::TabletServerErrorPB::TABLET_NOT_FOUND, response.error().code()); : peer->needs_tablet_copy = true; : VLOG_WITH_PREFIX_UNLOCKED(1) << "Marked peer as needing tablet copy: " : << peer->ToString(); : *more_pending = true; : return; : } : : // Sanity checks. : // Some of these can be eventually removed, but they are handy for now. : DCHECK(response.status().IsInitialized()) : << "Error: Uninitialized: " << response.InitializationErrorString() : << ". Response: "<< SecureShortDebugString(response); : // TODO: Include uuid in error messages as well. : DCHECK(response.has_responder_uuid() && !response.responder_uuid().empty()) : > see my comment on the call site Done http://gerrit.cloudera.org:8080/#/c/7440/7/src/kudu/master/catalog_manager.cc File src/kudu/master/catalog_manager.cc: PS7, Line 170: DEFINE_bool(master_tombstone_failed_tablet_replicas, true, : "Whether the master should tombstone (delete) tablet replicas that " : "are reporting a failed state. Only for testing!"); : TAG_FLAG(master_tombstone_failed_tablet_replicas, hidden); > is this a test only thing? As of now, yes. Will update to make that clear. http://gerrit.cloudera.org:8080/#/c/7440/7/src/kudu/tablet/metadata.proto File src/kudu/tablet/metadata.proto: PS7, Line 161: the tablet will be evicted and > ?? Should be evicted and replaced. -- To view, visit http://gerrit.cloudera.org:8080/7440 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345 Gerrit-PatchSet: 8 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew Wong <[email protected]> Gerrit-Reviewer: Andrew Wong <[email protected]> Gerrit-Reviewer: David Ribeiro Alves <[email protected]> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy <[email protected]> Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-HasComments: Yes
