[kudu-CR] [consensus] KUDU-1718: Fix few bugs around replica eviction failures

Dinesh Bhat (Code Review) Wed, 16 Nov 2016 17:28:35 -0800

Dinesh Bhat has posted comments on this change.

Change subject: [consensus] KUDU-1718: Fix few bugs around replica eviction 
failures
......................................................................



Patch Set 1:

(1 comment)

TFTR David,
 
> (1 comment)
 > 
 > would it be possible to split this patch into several ones that
 > take care of each ticker individually or would that be too hard?

I can do that if it helps for easier review, but clubbing them in one patch was 
intentional since they were all related to one set of problem space and it was 
easier to test them together. Also, the actual fixes are in few lines under 
consensus_peers.cc and catalog_manager.cc, majority of the diffs are in tests I 
think. 
PS: I need to move the kudu-tool-test diffs to a correct location, because I 
used kudu-tool-test as a placeholder and it is a terrible idea to place this 
test there.

http://gerrit.cloudera.org:8080/#/c/5111/1/src/kudu/consensus/consensus_peers.cc
File src/kudu/consensus/consensus_peers.cc:

PS1, Line 235:   if (!controller_.status().ok()) {
             :     if (controller_.status().IsRemoteError()) {
             :       // Most controller errors are caused by network issues or 
corner cases
             :       // like shutdown and failure to serialize a protobuf. 
Therefore, we
             :       // generally consider these errors to indicate an 
unreachable peer.
             :       // However, a RemoteError wraps some other error 
propagated from the
             :       // remote peer, so we know the remote is alive. Therefore, 
we will let
             :       // the queue know that the remote is responsive.
             :       
queue_->NotifyPeerIsResponsiveDespiteError(peer_pb_.permanent_uuid());
             :     }
             :     ProcessResponseError(controller_.status());
             :     return;
             :   }
             :   // Again, let the queue know that the remote is still 
responsive, since we
             :   // will not be sending this error response through to the 
queue.
             :   // For certain error codes, we want the queue to treat the 
remote as
             :   // unresponsive and take necessary actions, hence bypassing 
the notification
             :   // for those error codes.
             :   if (response_.has_error()) {
             :     if (response_.error().code() != 
TabletServerErrorPB::TABLET_NOT_FOUND &&
             :         response_.error().code() != 
TabletServerErrorPB::WRONG_SERVER_UUID &&
             :         response_.error().code() != 
TabletServerErrorPB::TABLET_NOT_RUNNING) {
             :       
queue_->NotifyPeerIsResponsiveDespiteError(peer_pb_.permanent_uuid());
             :       
ProcessResponseError(StatusFromPB(response_.error().status()));
             :       return;
             :     }
             :   } else if (response_.status().has_error()) {
             :     if (response_.status().error().code() == 
consensus::ConsensusErrorPB::CANNOT_PREPARE) {
             :       
queue_->NotifyPeerIsResponsiveDespiteError(peer_pb_.permanent_uuid());
             :       
ProcessResponseError(StatusFromPB(response_.error().status()));
             :       return;
             :     }
             :   }
> It would be good if we could have something like we have in the c++ client 
Yeah, that's a good idea, I will look into that. Also slightly related to this 
topic: I wanted to group these errors into 2 buckets,  fatal and retriable. 
That way we can consolidate all of the error codes in one place and less window 
for programmers to leave some error codes behind. I will attempt that in a 
different patch though as it is not clear to me whether such grouping could 
work in all situations. For eg, TABLET_NOT_FOUND may be a fatal error for 
consensus, but may not be a fatal for some other workflow like DeleteTablet.


-- 
To view, visit http://gerrit.cloudera.org:8080/5111
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0d3f84fd297a8e4760208a213c1ee393e92499a3
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <[email protected]>
Gerrit-Reviewer: Adar Dembo <[email protected]>
Gerrit-Reviewer: David Ribeiro Alves <[email protected]>
Gerrit-Reviewer: Dinesh Bhat <[email protected]>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <[email protected]>
Gerrit-HasComments: Yes

[kudu-CR] [consensus] KUDU-1718: Fix few bugs around replica eviction failures

Reply via email to