David Ribeiro Alves has posted comments on this change. Change subject: disk failure: reassign failed tablets ......................................................................
Patch Set 7: (5 comments) http://gerrit.cloudera.org:8080/#/c/7440/7/src/kudu/client/scanner-internal.cc File src/kudu/client/scanner-internal.cc: PS7, Line 232: case tserver::TabletServerErrorPB::TABLET_FAILED: // fall-through would it make more sense to have this be like: TABLET_NOT_FOUND? How do we respond to TABLET_NOT_RUNNING on the client? do we retry? my fear is that we might eventually do that (or might even do that already in some cases like single replica tablets) but this is a non-retriable error. Alternatively actually sending a tablet failed reply might make sense too. http://gerrit.cloudera.org:8080/#/c/7440/7/src/kudu/consensus/consensus_peers.cc File src/kudu/consensus/consensus_peers.cc: PS7, Line 284: response_.error().code() != TabletServerErrorPB::TABLET_FAILED maybe in this case we should directly call: NotifyObserversOfFailedFollower() or a new analog public method (like queue_->NotifyPeerHasFailed())? http://gerrit.cloudera.org:8080/#/c/7440/7/src/kudu/consensus/consensus_queue.cc File src/kudu/consensus/consensus_queue.cc: PS7, Line 638: // We only let special types of errors through to this point from the peer. : if (response.has_error()) { : // Initiate Tablet Copy on the peer if the tablet is not found. : if (tserver::TabletServerErrorPB::TABLET_NOT_FOUND == response.error().code()) { : peer->needs_tablet_copy = true; : VLOG_WITH_PREFIX_UNLOCKED(1) << "Marked peer as needing tablet copy: " : << peer->ToString(); : *more_pending = true; : } else { : // Ignore the response from the peer if it is failed. : CHECK(tserver::TabletServerErrorPB::TABLET_FAILED == response.error().code()) : << SecureShortDebugString(response); : VLOG_WITH_PREFIX_UNLOCKED(1) << "Ignoring response from tablet to promote eviction: " : << peer->ToString(); : *more_pending = false; : } : return; : } see my comment on the call site http://gerrit.cloudera.org:8080/#/c/7440/7/src/kudu/master/catalog_manager.cc File src/kudu/master/catalog_manager.cc: PS7, Line 170: DEFINE_bool(master_tombstone_failed_tablet_replicas, true, : "Whether the master should tombstone (delete) tablet replicas that " : "are reporting a failed state."); : TAG_FLAG(master_tombstone_failed_tablet_replicas, hidden); is this a test only thing? http://gerrit.cloudera.org:8080/#/c/7440/7/src/kudu/tablet/metadata.proto File src/kudu/tablet/metadata.proto: PS7, Line 161: the tablet will be replicated. ?? -- To view, visit http://gerrit.cloudera.org:8080/7440 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345 Gerrit-PatchSet: 7 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew Wong <[email protected]> Gerrit-Reviewer: Andrew Wong <[email protected]> Gerrit-Reviewer: David Ribeiro Alves <[email protected]> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy <[email protected]> Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-HasComments: Yes
