[jira] [Commented] (SOLR-9504) A replica with an empty index becomes the leader even when other more qualified replicas are in line

2016-09-29 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15532574#comment-15532574
 ] 

ASF subversion and git services commented on SOLR-9504:
---

Commit effd22457691420982534f47ee71cd52ef64b8b9 in lucene-solr's branch 
refs/heads/branch_6x from [~shalinmangar]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=effd224 ]

SOLR-9504: A replica with an empty index becomes the leader even when other 
more qualified replicas are in line

(cherry picked from commit ce24de5)


> A replica with an empty index becomes the leader even when other more 
> qualified replicas are in line
> 
>
> Key: SOLR-9504
> URL: https://issues.apache.org/jira/browse/SOLR-9504
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Affects Versions: master (7.0)
>Reporter: Shalin Shekhar Mangar
>Priority: Critical
>  Labels: impact-high
> Fix For: 6.3, master (7.0)
>
> Attachments: SOLR-9504.patch
>
>
> I haven't tried branch_6x or any release yet. But this is trivially 
> reproducible on master with the following steps:
> # Start two solr nodes
> # Create a collection with 1 shard, 1 replica so that one node is empty.
> # Index some documents
> # Shutdown the leader node
> # Use addreplica API to create a replica of the collection on the 
> still-running node. For some reason this API hangs until you restart the 
> other node (possibly a bug itself) but do not wait for the API to complete.
> # Restart the former leader node
> You'll find that the replica with 0 docs has become the leader. The former 
> leader recovers from the leader without replicating any index files. It still 
> has the old index which has some docs.
> This is from the logs of the 0 doc replica:
> {code}
> 713102 INFO  (zkCallback-4-thread-5-processing-n:127.0.1.1:7574_solr) [   ] 
> o.a.s.c.c.ZkStateReader Updating data for [gettingstarted] from [9] to [10]
> 714377 INFO  (qtp110456297-15) [c:gettingstarted s:shard1 r:core_node2 
> x:gettingstarted_shard1_replica2] o.a.s.c.ShardLeaderElectionContext Enough 
> replicas found to continue.
> 714377 INFO  (qtp110456297-15) [c:gettingstarted s:shard1 r:core_node2 
> x:gettingstarted_shard1_replica2] o.a.s.c.ShardLeaderElectionContext I may be 
> the new leader - try and sync
> 714377 INFO  (qtp110456297-15) [c:gettingstarted s:shard1 r:core_node2 
> x:gettingstarted_shard1_replica2] o.a.s.c.SyncStrategy Sync replicas to 
> http://127.0.1.1:7574/solr/gettingstarted_shard1_replica2/
> 714380 INFO  (qtp110456297-15) [c:gettingstarted s:shard1 r:core_node2 
> x:gettingstarted_shard1_replica2] o.a.s.u.PeerSync PeerSync: 
> core=gettingstarted_shard1_replica2 url=http://127.0.1.1:7574/solr START 
> replicas=[http://127.0.1.1:8983/solr/gettingstarted_shard1_replica1/] 
> nUpdates=100
> 714381 INFO  (qtp110456297-15) [c:gettingstarted s:shard1 r:core_node2 
> x:gettingstarted_shard1_replica2] o.a.s.u.PeerSync PeerSync: 
> core=gettingstarted_shard1_replica2 url=http://127.0.1.1:7574/solr DONE.  We 
> have no versions.  sync failed.
> 714382 INFO  (qtp110456297-15) [c:gettingstarted s:shard1 r:core_node2 
> x:gettingstarted_shard1_replica2] o.a.s.c.SyncStrategy Leader's attempt to 
> sync with shard failed, moving to the next candidate
> 714382 INFO  (qtp110456297-15) [c:gettingstarted s:shard1 r:core_node2 
> x:gettingstarted_shard1_replica2] o.a.s.c.ShardLeaderElectionContext We 
> failed sync, but we have no versions - we can't sync in that case - we were 
> active before, so become leader anyway
> 714387 INFO  (qtp110456297-15) [c:gettingstarted s:shard1 r:core_node2 
> x:gettingstarted_shard1_replica2] o.a.s.c.ShardLeaderElectionContextBase 
> Creating leader registration node 
> /collections/gettingstarted/leaders/shard1/leader after winning as 
> /collections/gettingstarted/leader_elect/shard1/election/96579592334475268-core_node2-n_01
> 714398 INFO  (qtp110456297-15) [c:gettingstarted s:shard1 r:core_node2 
> x:gettingstarted_shard1_replica2] o.a.s.c.ShardLeaderElectionContext I am the 
> new leader: http://127.0.1.1:7574/solr/gettingstarted_shard1_replica2/ shard1
> {code}
> It basically tries to sync but has no versions and because it was active 
> before (it is a new core starting up for the first time), it becomes the 
> leader and publishes itself as active.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9504) A replica with an empty index becomes the leader even when other more qualified replicas are in line

2016-09-29 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15532552#comment-15532552
 ] 

ASF subversion and git services commented on SOLR-9504:
---

Commit ce24de5cd65726dd9593512ec4082ba81b9d7801 in lucene-solr's branch 
refs/heads/master from [~shalinmangar]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=ce24de5 ]

SOLR-9504: A replica with an empty index becomes the leader even when other 
more qualified replicas are in line


> A replica with an empty index becomes the leader even when other more 
> qualified replicas are in line
> 
>
> Key: SOLR-9504
> URL: https://issues.apache.org/jira/browse/SOLR-9504
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Affects Versions: master (7.0)
>Reporter: Shalin Shekhar Mangar
>Priority: Critical
>  Labels: impact-high
> Fix For: 6.3, master (7.0)
>
> Attachments: SOLR-9504.patch
>
>
> I haven't tried branch_6x or any release yet. But this is trivially 
> reproducible on master with the following steps:
> # Start two solr nodes
> # Create a collection with 1 shard, 1 replica so that one node is empty.
> # Index some documents
> # Shutdown the leader node
> # Use addreplica API to create a replica of the collection on the 
> still-running node. For some reason this API hangs until you restart the 
> other node (possibly a bug itself) but do not wait for the API to complete.
> # Restart the former leader node
> You'll find that the replica with 0 docs has become the leader. The former 
> leader recovers from the leader without replicating any index files. It still 
> has the old index which has some docs.
> This is from the logs of the 0 doc replica:
> {code}
> 713102 INFO  (zkCallback-4-thread-5-processing-n:127.0.1.1:7574_solr) [   ] 
> o.a.s.c.c.ZkStateReader Updating data for [gettingstarted] from [9] to [10]
> 714377 INFO  (qtp110456297-15) [c:gettingstarted s:shard1 r:core_node2 
> x:gettingstarted_shard1_replica2] o.a.s.c.ShardLeaderElectionContext Enough 
> replicas found to continue.
> 714377 INFO  (qtp110456297-15) [c:gettingstarted s:shard1 r:core_node2 
> x:gettingstarted_shard1_replica2] o.a.s.c.ShardLeaderElectionContext I may be 
> the new leader - try and sync
> 714377 INFO  (qtp110456297-15) [c:gettingstarted s:shard1 r:core_node2 
> x:gettingstarted_shard1_replica2] o.a.s.c.SyncStrategy Sync replicas to 
> http://127.0.1.1:7574/solr/gettingstarted_shard1_replica2/
> 714380 INFO  (qtp110456297-15) [c:gettingstarted s:shard1 r:core_node2 
> x:gettingstarted_shard1_replica2] o.a.s.u.PeerSync PeerSync: 
> core=gettingstarted_shard1_replica2 url=http://127.0.1.1:7574/solr START 
> replicas=[http://127.0.1.1:8983/solr/gettingstarted_shard1_replica1/] 
> nUpdates=100
> 714381 INFO  (qtp110456297-15) [c:gettingstarted s:shard1 r:core_node2 
> x:gettingstarted_shard1_replica2] o.a.s.u.PeerSync PeerSync: 
> core=gettingstarted_shard1_replica2 url=http://127.0.1.1:7574/solr DONE.  We 
> have no versions.  sync failed.
> 714382 INFO  (qtp110456297-15) [c:gettingstarted s:shard1 r:core_node2 
> x:gettingstarted_shard1_replica2] o.a.s.c.SyncStrategy Leader's attempt to 
> sync with shard failed, moving to the next candidate
> 714382 INFO  (qtp110456297-15) [c:gettingstarted s:shard1 r:core_node2 
> x:gettingstarted_shard1_replica2] o.a.s.c.ShardLeaderElectionContext We 
> failed sync, but we have no versions - we can't sync in that case - we were 
> active before, so become leader anyway
> 714387 INFO  (qtp110456297-15) [c:gettingstarted s:shard1 r:core_node2 
> x:gettingstarted_shard1_replica2] o.a.s.c.ShardLeaderElectionContextBase 
> Creating leader registration node 
> /collections/gettingstarted/leaders/shard1/leader after winning as 
> /collections/gettingstarted/leader_elect/shard1/election/96579592334475268-core_node2-n_01
> 714398 INFO  (qtp110456297-15) [c:gettingstarted s:shard1 r:core_node2 
> x:gettingstarted_shard1_replica2] o.a.s.c.ShardLeaderElectionContext I am the 
> new leader: http://127.0.1.1:7574/solr/gettingstarted_shard1_replica2/ shard1
> {code}
> It basically tries to sync but has no versions and because it was active 
> before (it is a new core starting up for the first time), it becomes the 
> leader and publishes itself as active.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9504) A replica with an empty index becomes the leader even when other more qualified replicas are in line

2016-09-14 Thread Shalin Shekhar Mangar (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491676#comment-15491676
 ] 

Shalin Shekhar Mangar commented on SOLR-9504:
-

Longer term, we need to work on a bi-directional sync during recovery to really 
solve these kind of issues.

> A replica with an empty index becomes the leader even when other more 
> qualified replicas are in line
> 
>
> Key: SOLR-9504
> URL: https://issues.apache.org/jira/browse/SOLR-9504
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Affects Versions: master (7.0)
>Reporter: Shalin Shekhar Mangar
>Priority: Critical
>  Labels: impact-high
> Fix For: 6.3, master (7.0)
>
>
> I haven't tried branch_6x or any release yet. But this is trivially 
> reproducible on master with the following steps:
> # Start two solr nodes
> # Create a collection with 1 shard, 1 replica so that one node is empty.
> # Index some documents
> # Shutdown the leader node
> # Use addreplica API to create a replica of the collection on the 
> still-running node. For some reason this API hangs until you restart the 
> other node (possibly a bug itself) but do not wait for the API to complete.
> # Restart the former leader node
> You'll find that the replica with 0 docs has become the leader. The former 
> leader recovers from the leader without replicating any index files. It still 
> has the old index which has some docs.
> This is from the logs of the 0 doc replica:
> {code}
> 713102 INFO  (zkCallback-4-thread-5-processing-n:127.0.1.1:7574_solr) [   ] 
> o.a.s.c.c.ZkStateReader Updating data for [gettingstarted] from [9] to [10]
> 714377 INFO  (qtp110456297-15) [c:gettingstarted s:shard1 r:core_node2 
> x:gettingstarted_shard1_replica2] o.a.s.c.ShardLeaderElectionContext Enough 
> replicas found to continue.
> 714377 INFO  (qtp110456297-15) [c:gettingstarted s:shard1 r:core_node2 
> x:gettingstarted_shard1_replica2] o.a.s.c.ShardLeaderElectionContext I may be 
> the new leader - try and sync
> 714377 INFO  (qtp110456297-15) [c:gettingstarted s:shard1 r:core_node2 
> x:gettingstarted_shard1_replica2] o.a.s.c.SyncStrategy Sync replicas to 
> http://127.0.1.1:7574/solr/gettingstarted_shard1_replica2/
> 714380 INFO  (qtp110456297-15) [c:gettingstarted s:shard1 r:core_node2 
> x:gettingstarted_shard1_replica2] o.a.s.u.PeerSync PeerSync: 
> core=gettingstarted_shard1_replica2 url=http://127.0.1.1:7574/solr START 
> replicas=[http://127.0.1.1:8983/solr/gettingstarted_shard1_replica1/] 
> nUpdates=100
> 714381 INFO  (qtp110456297-15) [c:gettingstarted s:shard1 r:core_node2 
> x:gettingstarted_shard1_replica2] o.a.s.u.PeerSync PeerSync: 
> core=gettingstarted_shard1_replica2 url=http://127.0.1.1:7574/solr DONE.  We 
> have no versions.  sync failed.
> 714382 INFO  (qtp110456297-15) [c:gettingstarted s:shard1 r:core_node2 
> x:gettingstarted_shard1_replica2] o.a.s.c.SyncStrategy Leader's attempt to 
> sync with shard failed, moving to the next candidate
> 714382 INFO  (qtp110456297-15) [c:gettingstarted s:shard1 r:core_node2 
> x:gettingstarted_shard1_replica2] o.a.s.c.ShardLeaderElectionContext We 
> failed sync, but we have no versions - we can't sync in that case - we were 
> active before, so become leader anyway
> 714387 INFO  (qtp110456297-15) [c:gettingstarted s:shard1 r:core_node2 
> x:gettingstarted_shard1_replica2] o.a.s.c.ShardLeaderElectionContextBase 
> Creating leader registration node 
> /collections/gettingstarted/leaders/shard1/leader after winning as 
> /collections/gettingstarted/leader_elect/shard1/election/96579592334475268-core_node2-n_01
> 714398 INFO  (qtp110456297-15) [c:gettingstarted s:shard1 r:core_node2 
> x:gettingstarted_shard1_replica2] o.a.s.c.ShardLeaderElectionContext I am the 
> new leader: http://127.0.1.1:7574/solr/gettingstarted_shard1_replica2/ shard1
> {code}
> It basically tries to sync but has no versions and because it was active 
> before (it is a new core starting up for the first time), it becomes the 
> leader and publishes itself as active.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9504) A replica with an empty index becomes the leader even when other more qualified replicas are in line

2016-09-14 Thread Shalin Shekhar Mangar (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491672#comment-15491672
 ] 

Shalin Shekhar Mangar commented on SOLR-9504:
-

Whoops! You wrote the same thing that I did. I'll work on adding such a check 
to peer sync.

> A replica with an empty index becomes the leader even when other more 
> qualified replicas are in line
> 
>
> Key: SOLR-9504
> URL: https://issues.apache.org/jira/browse/SOLR-9504
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Affects Versions: master (7.0)
>Reporter: Shalin Shekhar Mangar
>Priority: Critical
>  Labels: impact-high
> Fix For: 6.3, master (7.0)
>
>
> I haven't tried branch_6x or any release yet. But this is trivially 
> reproducible on master with the following steps:
> # Start two solr nodes
> # Create a collection with 1 shard, 1 replica so that one node is empty.
> # Index some documents
> # Shutdown the leader node
> # Use addreplica API to create a replica of the collection on the 
> still-running node. For some reason this API hangs until you restart the 
> other node (possibly a bug itself) but do not wait for the API to complete.
> # Restart the former leader node
> You'll find that the replica with 0 docs has become the leader. The former 
> leader recovers from the leader without replicating any index files. It still 
> has the old index which has some docs.
> This is from the logs of the 0 doc replica:
> {code}
> 713102 INFO  (zkCallback-4-thread-5-processing-n:127.0.1.1:7574_solr) [   ] 
> o.a.s.c.c.ZkStateReader Updating data for [gettingstarted] from [9] to [10]
> 714377 INFO  (qtp110456297-15) [c:gettingstarted s:shard1 r:core_node2 
> x:gettingstarted_shard1_replica2] o.a.s.c.ShardLeaderElectionContext Enough 
> replicas found to continue.
> 714377 INFO  (qtp110456297-15) [c:gettingstarted s:shard1 r:core_node2 
> x:gettingstarted_shard1_replica2] o.a.s.c.ShardLeaderElectionContext I may be 
> the new leader - try and sync
> 714377 INFO  (qtp110456297-15) [c:gettingstarted s:shard1 r:core_node2 
> x:gettingstarted_shard1_replica2] o.a.s.c.SyncStrategy Sync replicas to 
> http://127.0.1.1:7574/solr/gettingstarted_shard1_replica2/
> 714380 INFO  (qtp110456297-15) [c:gettingstarted s:shard1 r:core_node2 
> x:gettingstarted_shard1_replica2] o.a.s.u.PeerSync PeerSync: 
> core=gettingstarted_shard1_replica2 url=http://127.0.1.1:7574/solr START 
> replicas=[http://127.0.1.1:8983/solr/gettingstarted_shard1_replica1/] 
> nUpdates=100
> 714381 INFO  (qtp110456297-15) [c:gettingstarted s:shard1 r:core_node2 
> x:gettingstarted_shard1_replica2] o.a.s.u.PeerSync PeerSync: 
> core=gettingstarted_shard1_replica2 url=http://127.0.1.1:7574/solr DONE.  We 
> have no versions.  sync failed.
> 714382 INFO  (qtp110456297-15) [c:gettingstarted s:shard1 r:core_node2 
> x:gettingstarted_shard1_replica2] o.a.s.c.SyncStrategy Leader's attempt to 
> sync with shard failed, moving to the next candidate
> 714382 INFO  (qtp110456297-15) [c:gettingstarted s:shard1 r:core_node2 
> x:gettingstarted_shard1_replica2] o.a.s.c.ShardLeaderElectionContext We 
> failed sync, but we have no versions - we can't sync in that case - we were 
> active before, so become leader anyway
> 714387 INFO  (qtp110456297-15) [c:gettingstarted s:shard1 r:core_node2 
> x:gettingstarted_shard1_replica2] o.a.s.c.ShardLeaderElectionContextBase 
> Creating leader registration node 
> /collections/gettingstarted/leaders/shard1/leader after winning as 
> /collections/gettingstarted/leader_elect/shard1/election/96579592334475268-core_node2-n_01
> 714398 INFO  (qtp110456297-15) [c:gettingstarted s:shard1 r:core_node2 
> x:gettingstarted_shard1_replica2] o.a.s.c.ShardLeaderElectionContext I am the 
> new leader: http://127.0.1.1:7574/solr/gettingstarted_shard1_replica2/ shard1
> {code}
> It basically tries to sync but has no versions and because it was active 
> before (it is a new core starting up for the first time), it becomes the 
> leader and publishes itself as active.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9504) A replica with an empty index becomes the leader even when other more qualified replicas are in line

2016-09-14 Thread Shalin Shekhar Mangar (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491668#comment-15491668
 ] 

Shalin Shekhar Mangar commented on SOLR-9504:
-

[~markrmil...@gmail.com] - The behavior when leader vote wait expires is well 
known but for it to happen before expiry of that period is a surprise (at least 
to me). Perhaps instead of just giving up if the leader candidate have no 
version, it can request versions anyway from peers and rejoin if others have 
some versions?

> A replica with an empty index becomes the leader even when other more 
> qualified replicas are in line
> 
>
> Key: SOLR-9504
> URL: https://issues.apache.org/jira/browse/SOLR-9504
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Affects Versions: master (7.0)
>Reporter: Shalin Shekhar Mangar
>Priority: Critical
>  Labels: impact-high
> Fix For: 6.3, master (7.0)
>
>
> I haven't tried branch_6x or any release yet. But this is trivially 
> reproducible on master with the following steps:
> # Start two solr nodes
> # Create a collection with 1 shard, 1 replica so that one node is empty.
> # Index some documents
> # Shutdown the leader node
> # Use addreplica API to create a replica of the collection on the 
> still-running node. For some reason this API hangs until you restart the 
> other node (possibly a bug itself) but do not wait for the API to complete.
> # Restart the former leader node
> You'll find that the replica with 0 docs has become the leader. The former 
> leader recovers from the leader without replicating any index files. It still 
> has the old index which has some docs.
> This is from the logs of the 0 doc replica:
> {code}
> 713102 INFO  (zkCallback-4-thread-5-processing-n:127.0.1.1:7574_solr) [   ] 
> o.a.s.c.c.ZkStateReader Updating data for [gettingstarted] from [9] to [10]
> 714377 INFO  (qtp110456297-15) [c:gettingstarted s:shard1 r:core_node2 
> x:gettingstarted_shard1_replica2] o.a.s.c.ShardLeaderElectionContext Enough 
> replicas found to continue.
> 714377 INFO  (qtp110456297-15) [c:gettingstarted s:shard1 r:core_node2 
> x:gettingstarted_shard1_replica2] o.a.s.c.ShardLeaderElectionContext I may be 
> the new leader - try and sync
> 714377 INFO  (qtp110456297-15) [c:gettingstarted s:shard1 r:core_node2 
> x:gettingstarted_shard1_replica2] o.a.s.c.SyncStrategy Sync replicas to 
> http://127.0.1.1:7574/solr/gettingstarted_shard1_replica2/
> 714380 INFO  (qtp110456297-15) [c:gettingstarted s:shard1 r:core_node2 
> x:gettingstarted_shard1_replica2] o.a.s.u.PeerSync PeerSync: 
> core=gettingstarted_shard1_replica2 url=http://127.0.1.1:7574/solr START 
> replicas=[http://127.0.1.1:8983/solr/gettingstarted_shard1_replica1/] 
> nUpdates=100
> 714381 INFO  (qtp110456297-15) [c:gettingstarted s:shard1 r:core_node2 
> x:gettingstarted_shard1_replica2] o.a.s.u.PeerSync PeerSync: 
> core=gettingstarted_shard1_replica2 url=http://127.0.1.1:7574/solr DONE.  We 
> have no versions.  sync failed.
> 714382 INFO  (qtp110456297-15) [c:gettingstarted s:shard1 r:core_node2 
> x:gettingstarted_shard1_replica2] o.a.s.c.SyncStrategy Leader's attempt to 
> sync with shard failed, moving to the next candidate
> 714382 INFO  (qtp110456297-15) [c:gettingstarted s:shard1 r:core_node2 
> x:gettingstarted_shard1_replica2] o.a.s.c.ShardLeaderElectionContext We 
> failed sync, but we have no versions - we can't sync in that case - we were 
> active before, so become leader anyway
> 714387 INFO  (qtp110456297-15) [c:gettingstarted s:shard1 r:core_node2 
> x:gettingstarted_shard1_replica2] o.a.s.c.ShardLeaderElectionContextBase 
> Creating leader registration node 
> /collections/gettingstarted/leaders/shard1/leader after winning as 
> /collections/gettingstarted/leader_elect/shard1/election/96579592334475268-core_node2-n_01
> 714398 INFO  (qtp110456297-15) [c:gettingstarted s:shard1 r:core_node2 
> x:gettingstarted_shard1_replica2] o.a.s.c.ShardLeaderElectionContext I am the 
> new leader: http://127.0.1.1:7574/solr/gettingstarted_shard1_replica2/ shard1
> {code}
> It basically tries to sync but has no versions and because it was active 
> before (it is a new core starting up for the first time), it becomes the 
> leader and publishes itself as active.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9504) A replica with an empty index becomes the leader even when other more qualified replicas are in line

2016-09-14 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491513#comment-15491513
 ] 

Mark Miller commented on SOLR-9504:
---

I think this one is known. There is another jira issue around this as well 
somewhere and it was understood as an ugly limitation when another bug around 
this was fixed. I had meant to add something to peer sync or something that let 
you check if another replica looked better because it wasn't empty or 
something, but never got to it. 

> A replica with an empty index becomes the leader even when other more 
> qualified replicas are in line
> 
>
> Key: SOLR-9504
> URL: https://issues.apache.org/jira/browse/SOLR-9504
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Affects Versions: master (7.0)
>Reporter: Shalin Shekhar Mangar
>Priority: Critical
>  Labels: impact-high
> Fix For: 6.3, master (7.0)
>
>
> I haven't tried branch_6x or any release yet. But this is trivially 
> reproducible on master with the following steps:
> # Start two solr nodes
> # Create a collection with 1 shard, 1 replica so that one node is empty.
> # Index some documents
> # Shutdown the leader node
> # Use addreplica API to create a replica of the collection on the 
> still-running node. For some reason this API hangs until you restart the 
> other node (possibly a bug itself) but do not wait for the API to complete.
> # Restart the former leader node
> You'll find that the replica with 0 docs has become the leader. The former 
> leader recovers from the leader without replicating any index files. It still 
> has the old index which has some docs.
> This is from the logs of the 0 doc replica:
> {code}
> 713102 INFO  (zkCallback-4-thread-5-processing-n:127.0.1.1:7574_solr) [   ] 
> o.a.s.c.c.ZkStateReader Updating data for [gettingstarted] from [9] to [10]
> 714377 INFO  (qtp110456297-15) [c:gettingstarted s:shard1 r:core_node2 
> x:gettingstarted_shard1_replica2] o.a.s.c.ShardLeaderElectionContext Enough 
> replicas found to continue.
> 714377 INFO  (qtp110456297-15) [c:gettingstarted s:shard1 r:core_node2 
> x:gettingstarted_shard1_replica2] o.a.s.c.ShardLeaderElectionContext I may be 
> the new leader - try and sync
> 714377 INFO  (qtp110456297-15) [c:gettingstarted s:shard1 r:core_node2 
> x:gettingstarted_shard1_replica2] o.a.s.c.SyncStrategy Sync replicas to 
> http://127.0.1.1:7574/solr/gettingstarted_shard1_replica2/
> 714380 INFO  (qtp110456297-15) [c:gettingstarted s:shard1 r:core_node2 
> x:gettingstarted_shard1_replica2] o.a.s.u.PeerSync PeerSync: 
> core=gettingstarted_shard1_replica2 url=http://127.0.1.1:7574/solr START 
> replicas=[http://127.0.1.1:8983/solr/gettingstarted_shard1_replica1/] 
> nUpdates=100
> 714381 INFO  (qtp110456297-15) [c:gettingstarted s:shard1 r:core_node2 
> x:gettingstarted_shard1_replica2] o.a.s.u.PeerSync PeerSync: 
> core=gettingstarted_shard1_replica2 url=http://127.0.1.1:7574/solr DONE.  We 
> have no versions.  sync failed.
> 714382 INFO  (qtp110456297-15) [c:gettingstarted s:shard1 r:core_node2 
> x:gettingstarted_shard1_replica2] o.a.s.c.SyncStrategy Leader's attempt to 
> sync with shard failed, moving to the next candidate
> 714382 INFO  (qtp110456297-15) [c:gettingstarted s:shard1 r:core_node2 
> x:gettingstarted_shard1_replica2] o.a.s.c.ShardLeaderElectionContext We 
> failed sync, but we have no versions - we can't sync in that case - we were 
> active before, so become leader anyway
> 714387 INFO  (qtp110456297-15) [c:gettingstarted s:shard1 r:core_node2 
> x:gettingstarted_shard1_replica2] o.a.s.c.ShardLeaderElectionContextBase 
> Creating leader registration node 
> /collections/gettingstarted/leaders/shard1/leader after winning as 
> /collections/gettingstarted/leader_elect/shard1/election/96579592334475268-core_node2-n_01
> 714398 INFO  (qtp110456297-15) [c:gettingstarted s:shard1 r:core_node2 
> x:gettingstarted_shard1_replica2] o.a.s.c.ShardLeaderElectionContext I am the 
> new leader: http://127.0.1.1:7574/solr/gettingstarted_shard1_replica2/ shard1
> {code}
> It basically tries to sync but has no versions and because it was active 
> before (it is a new core starting up for the first time), it becomes the 
> leader and publishes itself as active.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9504) A replica with an empty index becomes the leader even when other more qualified replicas are in line

2016-09-12 Thread Shalin Shekhar Mangar (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15485100#comment-15485100
 ] 

Shalin Shekhar Mangar commented on SOLR-9504:
-

FYI [~markrmil...@gmail.com], [~ysee...@gmail.com]

> A replica with an empty index becomes the leader even when other more 
> qualified replicas are in line
> 
>
> Key: SOLR-9504
> URL: https://issues.apache.org/jira/browse/SOLR-9504
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Affects Versions: master (7.0)
>Reporter: Shalin Shekhar Mangar
>Priority: Critical
>  Labels: impact-high
> Fix For: 6.3, master (7.0)
>
>
> I haven't tried branch_6x or any release yet. But this is trivially 
> reproducible on master with the following steps:
> # Start two solr nodes
> # Create a collection with 1 shard, 1 replica so that one node is empty.
> # Index some documents
> # Shutdown the leader node
> # Use addreplica API to create a replica of the collection on the 
> still-running node. For some reason this API hangs until you restart the 
> other node (possibly a bug itself) but do not wait for the API to complete.
> # Restart the former leader node
> You'll find that the replica with 0 docs has become the leader. The former 
> leader recovers from the leader without replicating any index files. It still 
> has the old index which has some docs.
> This is from the logs of the 0 doc replica:
> {code}
> 713102 INFO  (zkCallback-4-thread-5-processing-n:127.0.1.1:7574_solr) [   ] 
> o.a.s.c.c.ZkStateReader Updating data for [gettingstarted] from [9] to [10]
> 714377 INFO  (qtp110456297-15) [c:gettingstarted s:shard1 r:core_node2 
> x:gettingstarted_shard1_replica2] o.a.s.c.ShardLeaderElectionContext Enough 
> replicas found to continue.
> 714377 INFO  (qtp110456297-15) [c:gettingstarted s:shard1 r:core_node2 
> x:gettingstarted_shard1_replica2] o.a.s.c.ShardLeaderElectionContext I may be 
> the new leader - try and sync
> 714377 INFO  (qtp110456297-15) [c:gettingstarted s:shard1 r:core_node2 
> x:gettingstarted_shard1_replica2] o.a.s.c.SyncStrategy Sync replicas to 
> http://127.0.1.1:7574/solr/gettingstarted_shard1_replica2/
> 714380 INFO  (qtp110456297-15) [c:gettingstarted s:shard1 r:core_node2 
> x:gettingstarted_shard1_replica2] o.a.s.u.PeerSync PeerSync: 
> core=gettingstarted_shard1_replica2 url=http://127.0.1.1:7574/solr START 
> replicas=[http://127.0.1.1:8983/solr/gettingstarted_shard1_replica1/] 
> nUpdates=100
> 714381 INFO  (qtp110456297-15) [c:gettingstarted s:shard1 r:core_node2 
> x:gettingstarted_shard1_replica2] o.a.s.u.PeerSync PeerSync: 
> core=gettingstarted_shard1_replica2 url=http://127.0.1.1:7574/solr DONE.  We 
> have no versions.  sync failed.
> 714382 INFO  (qtp110456297-15) [c:gettingstarted s:shard1 r:core_node2 
> x:gettingstarted_shard1_replica2] o.a.s.c.SyncStrategy Leader's attempt to 
> sync with shard failed, moving to the next candidate
> 714382 INFO  (qtp110456297-15) [c:gettingstarted s:shard1 r:core_node2 
> x:gettingstarted_shard1_replica2] o.a.s.c.ShardLeaderElectionContext We 
> failed sync, but we have no versions - we can't sync in that case - we were 
> active before, so become leader anyway
> 714387 INFO  (qtp110456297-15) [c:gettingstarted s:shard1 r:core_node2 
> x:gettingstarted_shard1_replica2] o.a.s.c.ShardLeaderElectionContextBase 
> Creating leader registration node 
> /collections/gettingstarted/leaders/shard1/leader after winning as 
> /collections/gettingstarted/leader_elect/shard1/election/96579592334475268-core_node2-n_01
> 714398 INFO  (qtp110456297-15) [c:gettingstarted s:shard1 r:core_node2 
> x:gettingstarted_shard1_replica2] o.a.s.c.ShardLeaderElectionContext I am the 
> new leader: http://127.0.1.1:7574/solr/gettingstarted_shard1_replica2/ shard1
> {code}
> It basically tries to sync but has no versions and because it was active 
> before (it is a new core starting up for the first time), it becomes the 
> leader and publishes itself as active.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org