Dinesh Bhat has posted comments on this change. Change subject: [consensus] KUDU-1718: Fix few bugs around replica eviction failures ......................................................................
Patch Set 1: (1 comment) TFTR David, > (1 comment) > > would it be possible to split this patch into several ones that > take care of each ticker individually or would that be too hard? I can do that if it helps for easier review, but clubbing them in one patch was intentional since they were all related to one set of problem space and it was easier to test them together. Also, the actual fixes are in few lines under consensus_peers.cc and catalog_manager.cc, majority of the diffs are in tests I think. PS: I need to move the kudu-tool-test diffs to a correct location, because I used kudu-tool-test as a placeholder and it is a terrible idea to place this test there. http://gerrit.cloudera.org:8080/#/c/5111/1/src/kudu/consensus/consensus_peers.cc File src/kudu/consensus/consensus_peers.cc: PS1, Line 235: if (!controller_.status().ok()) { : if (controller_.status().IsRemoteError()) { : // Most controller errors are caused by network issues or corner cases : // like shutdown and failure to serialize a protobuf. Therefore, we : // generally consider these errors to indicate an unreachable peer. : // However, a RemoteError wraps some other error propagated from the : // remote peer, so we know the remote is alive. Therefore, we will let : // the queue know that the remote is responsive. : queue_->NotifyPeerIsResponsiveDespiteError(peer_pb_.permanent_uuid()); : } : ProcessResponseError(controller_.status()); : return; : } : // Again, let the queue know that the remote is still responsive, since we : // will not be sending this error response through to the queue. : // For certain error codes, we want the queue to treat the remote as : // unresponsive and take necessary actions, hence bypassing the notification : // for those error codes. : if (response_.has_error()) { : if (response_.error().code() != TabletServerErrorPB::TABLET_NOT_FOUND && : response_.error().code() != TabletServerErrorPB::WRONG_SERVER_UUID && : response_.error().code() != TabletServerErrorPB::TABLET_NOT_RUNNING) { : queue_->NotifyPeerIsResponsiveDespiteError(peer_pb_.permanent_uuid()); : ProcessResponseError(StatusFromPB(response_.error().status())); : return; : } : } else if (response_.status().has_error()) { : if (response_.status().error().code() == consensus::ConsensusErrorPB::CANNOT_PREPARE) { : queue_->NotifyPeerIsResponsiveDespiteError(peer_pb_.permanent_uuid()); : ProcessResponseError(StatusFromPB(response_.error().status())); : return; : } : } > It would be good if we could have something like we have in the c++ client Yeah, that's a good idea, I will look into that. Also slightly related to this topic: I wanted to group these errors into 2 buckets, fatal and retriable. That way we can consolidate all of the error codes in one place and less window for programmers to leave some error codes behind. I will attempt that in a different patch though as it is not clear to me whether such grouping could work in all situations. For eg, TABLET_NOT_FOUND may be a fatal error for consensus, but may not be a fatal for some other workflow like DeleteTablet. -- To view, visit http://gerrit.cloudera.org:8080/5111 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I0d3f84fd297a8e4760208a213c1ee393e92499a3 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat <[email protected]> Gerrit-Reviewer: Adar Dembo <[email protected]> Gerrit-Reviewer: David Ribeiro Alves <[email protected]> Gerrit-Reviewer: Dinesh Bhat <[email protected]> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-HasComments: Yes
