[ 
https://issues.apache.org/jira/browse/SOLR-9504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491676#comment-15491676
 ] 

Shalin Shekhar Mangar commented on SOLR-9504:
---------------------------------------------

Longer term, we need to work on a bi-directional sync during recovery to really 
solve these kind of issues.

> A replica with an empty index becomes the leader even when other more 
> qualified replicas are in line
> ----------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-9504
>                 URL: https://issues.apache.org/jira/browse/SOLR-9504
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrCloud
>    Affects Versions: master (7.0)
>            Reporter: Shalin Shekhar Mangar
>            Priority: Critical
>              Labels: impact-high
>             Fix For: 6.3, master (7.0)
>
>
> I haven't tried branch_6x or any release yet. But this is trivially 
> reproducible on master with the following steps:
> # Start two solr nodes
> # Create a collection with 1 shard, 1 replica so that one node is empty.
> # Index some documents
> # Shutdown the leader node
> # Use addreplica API to create a replica of the collection on the 
> still-running node. For some reason this API hangs until you restart the 
> other node (possibly a bug itself) but do not wait for the API to complete.
> # Restart the former leader node
> You'll find that the replica with 0 docs has become the leader. The former 
> leader recovers from the leader without replicating any index files. It still 
> has the old index which has some docs.
> This is from the logs of the 0 doc replica:
> {code}
> 713102 INFO  (zkCallback-4-thread-5-processing-n:127.0.1.1:7574_solr) [   ] 
> o.a.s.c.c.ZkStateReader Updating data for [gettingstarted] from [9] to [10]
> 714377 INFO  (qtp110456297-15) [c:gettingstarted s:shard1 r:core_node2 
> x:gettingstarted_shard1_replica2] o.a.s.c.ShardLeaderElectionContext Enough 
> replicas found to continue.
> 714377 INFO  (qtp110456297-15) [c:gettingstarted s:shard1 r:core_node2 
> x:gettingstarted_shard1_replica2] o.a.s.c.ShardLeaderElectionContext I may be 
> the new leader - try and sync
> 714377 INFO  (qtp110456297-15) [c:gettingstarted s:shard1 r:core_node2 
> x:gettingstarted_shard1_replica2] o.a.s.c.SyncStrategy Sync replicas to 
> http://127.0.1.1:7574/solr/gettingstarted_shard1_replica2/
> 714380 INFO  (qtp110456297-15) [c:gettingstarted s:shard1 r:core_node2 
> x:gettingstarted_shard1_replica2] o.a.s.u.PeerSync PeerSync: 
> core=gettingstarted_shard1_replica2 url=http://127.0.1.1:7574/solr START 
> replicas=[http://127.0.1.1:8983/solr/gettingstarted_shard1_replica1/] 
> nUpdates=100
> 714381 INFO  (qtp110456297-15) [c:gettingstarted s:shard1 r:core_node2 
> x:gettingstarted_shard1_replica2] o.a.s.u.PeerSync PeerSync: 
> core=gettingstarted_shard1_replica2 url=http://127.0.1.1:7574/solr DONE.  We 
> have no versions.  sync failed.
> 714382 INFO  (qtp110456297-15) [c:gettingstarted s:shard1 r:core_node2 
> x:gettingstarted_shard1_replica2] o.a.s.c.SyncStrategy Leader's attempt to 
> sync with shard failed, moving to the next candidate
> 714382 INFO  (qtp110456297-15) [c:gettingstarted s:shard1 r:core_node2 
> x:gettingstarted_shard1_replica2] o.a.s.c.ShardLeaderElectionContext We 
> failed sync, but we have no versions - we can't sync in that case - we were 
> active before, so become leader anyway
> 714387 INFO  (qtp110456297-15) [c:gettingstarted s:shard1 r:core_node2 
> x:gettingstarted_shard1_replica2] o.a.s.c.ShardLeaderElectionContextBase 
> Creating leader registration node 
> /collections/gettingstarted/leaders/shard1/leader after winning as 
> /collections/gettingstarted/leader_elect/shard1/election/96579592334475268-core_node2-n_0000000001
> 714398 INFO  (qtp110456297-15) [c:gettingstarted s:shard1 r:core_node2 
> x:gettingstarted_shard1_replica2] o.a.s.c.ShardLeaderElectionContext I am the 
> new leader: http://127.0.1.1:7574/solr/gettingstarted_shard1_replica2/ shard1
> {code}
> It basically tries to sync but has no versions and because it was active 
> before (it is a new core starting up for the first time), it becomes the 
> leader and publishes itself as active.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to