Todd Lipcon commented on KUDU-1618:

bq. Todd Lipcon thanks for a quick reply, by 'shouldn't have a replica' in 
above comment, you meant current tablet server where we are trying to bring up 
the replica, is not part of raft config for that tablet anymore right ? It has 
other tservers as replicas at this point. That makes sense. I believe tserver 
keeps trying until there may be another change_config in future which brings in 
this tserver as replica for that tablet.

Right, when you copied the replica it copied the new configuration, and it's 
not a part of that configuration. So, it knows that it shouldn't try to get the 
other nodes to vote for it. It would be reasonable to say that we should detect 
this scenario and mark the tablet as 'failed', but it's actually somewhat 
useful occasionally -- eg I've used this before to copy a tablet from a running 
cluster onto my laptop so I could then use tools like 'dump_tablet' against it 
locally. Given that I don't think you can get into this state without using 
explicit tablet copy repair tools, I don't think it should really be considered 
a bug.

bq. One follow up Qn is: What state should the replica be in after step 6 ? I 
see it in RUNNING state, which was slightly confusing, because this replica 
isn't an active replica at this point.

The tablet's state is referring more to the data layer. It's up and running, it 
has replayed its log, it has valid data, etc. So it's RUNNING even though it's 
not actually an active part of any raft configuration. If you run ksck on this 
cluster does ksck report the "extra" replica anywhere? That might be a useful 
thing to do so we can detect if this ever happens in real life.

> Add local_replica tool to delete a replica
> ------------------------------------------
>                 Key: KUDU-1618
>                 URL: https://issues.apache.org/jira/browse/KUDU-1618
>             Project: Kudu
>          Issue Type: Improvement
>          Components: ops-tooling
>    Affects Versions: 1.0.0
>            Reporter: Todd Lipcon
>            Assignee: Dinesh Bhat
> Occasionally we've hit cases where a tablet is corrupt in such a way that the 
> tserver fails to start or crashes soon after starting. Typically we'd prefer 
> the tablet just get marked FAILED but in the worst case it causes the whole 
> tserver to fail.
> For these cases we should add a 'local_replica' subtool to fully remove a 
> local tablet. Related, it might be useful to have a 'local_replica archive' 
> which would create a tarball from the data in this tablet for later 
> examination by developers.

This message was sent by Atlassian JIRA

Reply via email to