[jira] [Commented] (SOLR-8619) A new replica should not become leader when all current replicas are down as it leads to data loss

Mark Miller (JIRA) Thu, 28 Jan 2016 19:06:58 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-8619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15122860#comment-15122860
 ]


Mark Miller commented on SOLR-8619:
-----------------------------------

bq. Sure, I strongly think we need to be intelligent in electing leaders. That 
would solve this problem but why would we want a new replica to get added up 
that can't do anything but consume resources for a core? Not a ton of resources 
but still. I guess you'll agree.

Because it seems if you want to add a replica, you want to add a replica.

Let's say I do add replica right when the first replica goes down for some 
reason - like it loses it's zk connection due to a GC event. But then it 
connects again. It almost seems preferable to me that my add replica call still 
works, but it won't become the leader - then when the first replica quickly 
re-establishes its connection to Zk, it will recover from it.

My thinking is, if I want to add a replica, I don't care that it has no one to 
recover from at any given moment. I want to add a replica to the shard now. Let 
the system work out when it's safe and possible to sync up with the shard. 
Otherwise, I have to process the fail, go look at why it happened, try and get 
that straightened out, try the call again, repeat, etc.

There doesn't seem to be a strong reason to fail - the call can easily work and 
when the other replicas come back on line, everything will settle out. We just 
want to make sure it won't become the leader without recovering first.


> A new replica should not become leader when all current replicas are down as 
> it leads to data loss
> --------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-8619
>                 URL: https://issues.apache.org/jira/browse/SOLR-8619
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Anshum Gupta
>
> Here's what I'm talking about:
> * Start a 2 node solrcloud cluster
> * Create a 1 shard/1 replica collection
> * Add documents
> * Shut down the node that has the only active shard
> * ADDREPLICA for the shard/collection, so Solr would attempt to add a new 
> replica on the other node
> * Solr waits for a while before this replica becomes an active leader.
> * Index a few new docs
> * Bring up the old node
> * The replica comes up, with it's old index and then syncs to only contain 
> the docs from the new leader.
> All old documents are lost in this case
> Here are a few things that might work here:
> 1. Reject an ADDREPLICA call if all current replicas for the shard are down. 
> Considering the new replica can not sync from anyone, it doesn't make sense 
> for this replica to even come up
> 2. The replica shouldn't become active/leader unless either it was the last 
> known leader or active before it went into recovering state
> unless there are no other replicas in the clusterstate.
> This might very well be related to SOLR-8173 but we should add a check to 
> ADDREPLICA as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-8619) A new replica should not become leader when all current replicas are down as it leads to data loss

Reply via email to