[jira] [Commented] (SOLR-8173) CLONE - Leader recovery process can select the wrong leader if all replicas for a shard are down and trying to recover as well as lose updates that should have been reco
[ https://issues.apache.org/jira/browse/SOLR-8173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15894821#comment-15894821 ] Frank Kelly commented on SOLR-8173: --- I agree. This is a critical problem when ZooKeeper and Solr disagree as who the leader there needs to be a winner rather stay in some unrecoverable state. Even if it just randomly picked one shard - a fully operational but slightly "off" search index is better than no index at all. > CLONE - Leader recovery process can select the wrong leader if all replicas > for a shard are down and trying to recover as well as lose updates that > should have been recovered. > --- > > Key: SOLR-8173 > URL: https://issues.apache.org/jira/browse/SOLR-8173 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Reporter: Matteo Grolla >Assignee: Mark Miller >Priority: Critical > Labels: leader, recovery > Attachments: solr_8983.log, solr_8984.log > > > I'm doing this test > collection test is replicated on two solr nodes running on 8983, 8984 > using external zk > initially both nodes are empty > 1)turn on solr 8983 > 2)add,commit a doc x con solr 8983 > 3)turn off solr 8983 > 4)turn on solr 8984 > 5)shortly after (leader still not elected) turn on solr 8983 > 6)8984 is elected as leader > 7)doc x is present on 8983 but not on 8984 (check issuing a query) > In attachment are the solr.log files of both instances -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8173) CLONE - Leader recovery process can select the wrong leader if all replicas for a shard are down and trying to recover as well as lose updates that should have been reco
[ https://issues.apache.org/jira/browse/SOLR-8173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15894789#comment-15894789 ] Amrit Sarkar commented on SOLR-8173: Are we planning to resolve this any time soon? > CLONE - Leader recovery process can select the wrong leader if all replicas > for a shard are down and trying to recover as well as lose updates that > should have been recovered. > --- > > Key: SOLR-8173 > URL: https://issues.apache.org/jira/browse/SOLR-8173 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Reporter: Matteo Grolla >Assignee: Mark Miller >Priority: Critical > Labels: leader, recovery > Attachments: solr_8983.log, solr_8984.log > > > I'm doing this test > collection test is replicated on two solr nodes running on 8983, 8984 > using external zk > initially both nodes are empty > 1)turn on solr 8983 > 2)add,commit a doc x con solr 8983 > 3)turn off solr 8983 > 4)turn on solr 8984 > 5)shortly after (leader still not elected) turn on solr 8983 > 6)8984 is elected as leader > 7)doc x is present on 8983 but not on 8984 (check issuing a query) > In attachment are the solr.log files of both instances -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8173) CLONE - Leader recovery process can select the wrong leader if all replicas for a shard are down and trying to recover as well as lose updates that should have been reco
[ https://issues.apache.org/jira/browse/SOLR-8173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15128249#comment-15128249 ] Stephan Lagraulet commented on SOLR-8173: - Thanks, sounds like a good process. > CLONE - Leader recovery process can select the wrong leader if all replicas > for a shard are down and trying to recover as well as lose updates that > should have been recovered. > --- > > Key: SOLR-8173 > URL: https://issues.apache.org/jira/browse/SOLR-8173 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Reporter: Matteo Grolla >Assignee: Mark Miller >Priority: Critical > Labels: leader, recovery > Attachments: solr_8983.log, solr_8984.log > > > I'm doing this test > collection test is replicated on two solr nodes running on 8983, 8984 > using external zk > initially both nodes are empty > 1)turn on solr 8983 > 2)add,commit a doc x con solr 8983 > 3)turn off solr 8983 > 4)turn on solr 8984 > 5)shortly after (leader still not elected) turn on solr 8983 > 6)8984 is elected as leader > 7)doc x is present on 8983 but not on 8984 (check issuing a query) > In attachment are the solr.log files of both instances -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8173) CLONE - Leader recovery process can select the wrong leader if all replicas for a shard are down and trying to recover as well as lose updates that should have been reco
[ https://issues.apache.org/jira/browse/SOLR-8173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15126098#comment-15126098 ] Stephan Lagraulet commented on SOLR-8173: - Can you remove Fix version 5.2.1 if this bug is not resolved? > CLONE - Leader recovery process can select the wrong leader if all replicas > for a shard are down and trying to recover as well as lose updates that > should have been recovered. > --- > > Key: SOLR-8173 > URL: https://issues.apache.org/jira/browse/SOLR-8173 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Reporter: Matteo Grolla >Assignee: Mark Miller >Priority: Critical > Labels: leader, recovery > Fix For: 5.2.1 > > Attachments: solr_8983.log, solr_8984.log > > > I'm doing this test > collection test is replicated on two solr nodes running on 8983, 8984 > using external zk > initially both nodes are empty > 1)turn on solr 8983 > 2)add,commit a doc x con solr 8983 > 3)turn off solr 8983 > 4)turn on solr 8984 > 5)shortly after (leader still not elected) turn on solr 8983 > 6)8984 is elected as leader > 7)doc x is present on 8983 but not on 8984 (check issuing a query) > In attachment are the solr.log files of both instances -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8173) CLONE - Leader recovery process can select the wrong leader if all replicas for a shard are down and trying to recover as well as lose updates that should have been reco
[ https://issues.apache.org/jira/browse/SOLR-8173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15126461#comment-15126461 ] Shawn Heisey commented on SOLR-8173: [~stephlag], I have removed 5.2.1 from the "fix version" list. Typically this field is meaningless if the issue is unresolved. Some people do populate it when creating an issue, to indicate the version they *think* it should be fixed in, but it doesn't mean anything until a fix is committed and the issue is resolved as Fixed. When I start working on an issue, I will usually blank out "fix version", unless I happen to know with complete certainty when I will finish the work and which version will contain the fix. Knowing this with complete certainty is rare. > CLONE - Leader recovery process can select the wrong leader if all replicas > for a shard are down and trying to recover as well as lose updates that > should have been recovered. > --- > > Key: SOLR-8173 > URL: https://issues.apache.org/jira/browse/SOLR-8173 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Reporter: Matteo Grolla >Assignee: Mark Miller >Priority: Critical > Labels: leader, recovery > Attachments: solr_8983.log, solr_8984.log > > > I'm doing this test > collection test is replicated on two solr nodes running on 8983, 8984 > using external zk > initially both nodes are empty > 1)turn on solr 8983 > 2)add,commit a doc x con solr 8983 > 3)turn off solr 8983 > 4)turn on solr 8984 > 5)shortly after (leader still not elected) turn on solr 8983 > 6)8984 is elected as leader > 7)doc x is present on 8983 but not on 8984 (check issuing a query) > In attachment are the solr.log files of both instances -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8173) CLONE - Leader recovery process can select the wrong leader if all replicas for a shard are down and trying to recover as well as lose updates that should have been reco
[ https://issues.apache.org/jira/browse/SOLR-8173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14991039#comment-14991039 ] Varun Thacker commented on SOLR-8173: - I tried to reproduce this and I think there could be two bugs in play here: 1. The bug Matteo mentioned . These were the steps I used to reproduce it {code} ./bin/solr start -e cloud -noprompt -z localhost:2181 http://localhost:8983/solr/admin/collections?action=CREATE=test3=gettingstarted=1=2 core_node1 = core_node2 = active ./bin/solr stop -p 7574 core_node2 = down curl http://127.0.0.1:8983/solr/test3/update?commit=true -H 'Content-type:application/json' -d '[{"id" : "1"}]' ./bin/solr stop -p 8983 ./bin/solr start -c -z localhost:2181 -s example/cloud/node2/solr -p 7574; sleep 10; ./bin/solr start -c -z localhost:2181 -s example/cloud/node1/solr -p 8983 At this point both replicas are 'ACTIVE' , replica 2 becomes the leader and the collection has 0 documents. {code} 2. A slight variation of the test also leads to lost updates. These were the steps I used to reproduce it. {code} ./bin/solr start -e cloud -noprompt -z localhost:2181 http://localhost:8983/solr/admin/collections?action=CREATE=test1=gettingstarted=1=2 core_node1 = core_node2 = active ./bin/solr stop -p 7574 core_node2 = down curl http://127.0.0.1:8983/solr/test1/update?commit=true -H 'Content-type:application/json' -d '[{"id" : "1"}]' ./bin/solr stop -p 8983 ./bin/solr start -c -z localhost:2181 -s example/cloud/node2/solr -p 7574 {code} {code} Replica 2 does not take leadership till timeout. It stays in down state. INFO - 2015-10-26 23:15:53.026; [c:test1 s:shard1 r:core_node2 x:test1_shard1_replica1] org.apache.solr.cloud.ShardLeaderElectionContext; Waiting until we see more replicas up for shard shard1: total=2 found=1 timeoutin=139681ms Replica 2 becomes leader after timeout INFO - 2015-10-26 23:18:13.127; [c:test1 s:shard1 r:core_node2 x:test1_shard1_replica1] org.apache.solr.cloud.ShardLeaderElectionContext; Was waiting for replicas to come up, but they are taking too long - assuming they won't come back till later INFO - 2015-10-26 23:18:13.128; [c:test1 s:shard1 r:core_node2 x:test1_shard1_replica1] org.apache.solr.cloud.ShardLeaderElectionContext; I may be the new leader - try and sync INFO - 2015-10-26 23:18:13.129; [c:test1 s:shard1 r:core_node2 x:test1_shard1_replica1] org.apache.solr.cloud.SyncStrategy; Sync replicas to http://192.168.1.9:7574/solr/test1_shard1_replica1/ INFO - 2015-10-26 23:18:13.129; [c:test1 s:shard1 r:core_node2 x:test1_shard1_replica1] org.apache.solr.cloud.SyncStrategy; Sync Success - now sync replicas to me INFO - 2015-10-26 23:18:13.130; [c:test1 s:shard1 r:core_node2 x:test1_shard1_replica1] org.apache.solr.cloud.SyncStrategy; http://192.168.1.9:7574/solr/test1_shard1_replica1/ has no replicas INFO - 2015-10-26 23:18:13.131; [c:test1 s:shard1 r:core_node2 x:test1_shard1_replica1] org.apache.solr.cloud.ShardLeaderElectionContext; I am the new leader: http://192.168.1.9:7574/solr/test1_shard1_replica1/ shard1 {code} So for the first case I am guessing that the znode at the head gets picked up as the leader when all replicas are active. If thats the case can we pick the replica which has the latest data ? In the second case after the timeout a replica can become a leader. Thinking aloud should we mark the replica as recovery failed instead by default and have a parameter which when specified allows any replica to become the leader? > CLONE - Leader recovery process can select the wrong leader if all replicas > for a shard are down and trying to recover as well as lose updates that > should have been recovered. > --- > > Key: SOLR-8173 > URL: https://issues.apache.org/jira/browse/SOLR-8173 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Reporter: Matteo Grolla >Assignee: Mark Miller >Priority: Critical > Labels: leader, recovery > Fix For: 5.2.1 > > Attachments: solr_8983.log, solr_8984.log > > > I'm doing this test > collection test is replicated on two solr nodes running on 8983, 8984 > using external zk > initially both nodes are empty > 1)turn on solr 8983 > 2)add,commit a doc x con solr 8983 > 3)turn off solr 8983 > 4)turn on solr 8984 > 5)shortly after (leader still not elected) turn on solr 8983 > 6)8984 is elected as leader > 7)doc x is present on 8983 but not on 8984 (check issuing a query) > In attachment are the solr.log files of both instances -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail:
[jira] [Commented] (SOLR-8173) CLONE - Leader recovery process can select the wrong leader if all replicas for a shard are down and trying to recover as well as lose updates that should have been reco
[ https://issues.apache.org/jira/browse/SOLR-8173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14978644#comment-14978644 ] Matteo Grolla commented on SOLR-8173: - Hi Mark, to me the problem happens also with a nonempty index, to reproduce: initially both nodes -are shut down -CONTAIN DOCUMENT X 1)turn on solr 8983 2)add,commit a doc y con solr 8983 3)turn off solr 8983 4)turn on solr 8984 5)shortly after (leader still not elected) turn on solr 8983 6)8984 is elected as leader 7)doc y is present on 8983 but not on 8984 (check issuing a query), which only contains document x How can I test a scenario where the default leaderVoteWait makes the right leader be elected? > CLONE - Leader recovery process can select the wrong leader if all replicas > for a shard are down and trying to recover as well as lose updates that > should have been recovered. > --- > > Key: SOLR-8173 > URL: https://issues.apache.org/jira/browse/SOLR-8173 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Reporter: Matteo Grolla >Assignee: Mark Miller >Priority: Critical > Labels: leader, recovery > Fix For: 5.2.1 > > Attachments: solr_8983.log, solr_8984.log > > > I'm doing this test > collection test is replicated on two solr nodes running on 8983, 8984 > using external zk > initially both nodes are empty > 1)turn on solr 8983 > 2)add,commit a doc x con solr 8983 > 3)turn off solr 8983 > 4)turn on solr 8984 > 5)shortly after (leader still not elected) turn on solr 8983 > 6)8984 is elected as leader > 7)doc x is present on 8983 but not on 8984 (check issuing a query) > In attachment are the solr.log files of both instances -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8173) CLONE - Leader recovery process can select the wrong leader if all replicas for a shard are down and trying to recover as well as lose updates that should have been reco
[ https://issues.apache.org/jira/browse/SOLR-8173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14972854#comment-14972854 ] Mark Miller commented on SOLR-8173: --- >From my testing with this type of use case, the problem is how we are dealing >with an empty index. The leader election needs to be smarter about knowing if an empty index is a good candidate to be leader. We need something that is part of the sync up phase that checks if any participating replicas have any tlogs. If they do, a replica with no tlogs should not become leader. > CLONE - Leader recovery process can select the wrong leader if all replicas > for a shard are down and trying to recover as well as lose updates that > should have been recovered. > --- > > Key: SOLR-8173 > URL: https://issues.apache.org/jira/browse/SOLR-8173 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Reporter: Matteo Grolla >Assignee: Mark Miller >Priority: Critical > Labels: leader, recovery > Fix For: 5.2.1 > > Attachments: solr_8983.log, solr_8984.log > > > I'm doing this test > collection test is replicated on two solr nodes running on 8983, 8984 > using external zk > 1)turn on solr 8984 > 2)add,commit a doc x con solr 8983 > 3)turn off solr 8983 > 4)turn on solr 8984 > 5)shortly after (leader still not elected) turn on solr 8983 > 6)8984 is elected as leader > 7)doc x is present on 8983 but not on 8984 (check issuing a query) > In attachment are the solr.log files of both instances -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8173) CLONE - Leader recovery process can select the wrong leader if all replicas for a shard are down and trying to recover as well as lose updates that should have been reco
[ https://issues.apache.org/jira/browse/SOLR-8173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14971207#comment-14971207 ] Matteo Grolla commented on SOLR-8173: - Yes, Unpacked zip Cloned server folder Started 2 node cluster using bin/solr script Created 'schemaless' collection using bin/solr script and ran the described test. > CLONE - Leader recovery process can select the wrong leader if all replicas > for a shard are down and trying to recover as well as lose updates that > should have been recovered. > --- > > Key: SOLR-8173 > URL: https://issues.apache.org/jira/browse/SOLR-8173 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Reporter: Matteo Grolla >Assignee: Mark Miller >Priority: Critical > Labels: leader, recovery > Fix For: 5.2.1 > > Attachments: solr_8983.log, solr_8984.log > > > I'm doing this test > collection test is replicated on two solr nodes running on 8983, 8984 > using external zk > 1)turn on solr 8984 > 2)add,commit a doc x con solr 8983 > 3)turn off solr 8983 > 4)turn on solr 8984 > 5)shortly after (leader still not elected) turn on solr 8983 > 6)8984 is elected as leader > 7)doc x is present on 8983 but not on 8984 (check issuing a query) > In attachment are the solr.log files of both instances -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8173) CLONE - Leader recovery process can select the wrong leader if all replicas for a shard are down and trying to recover as well as lose updates that should have been reco
[ https://issues.apache.org/jira/browse/SOLR-8173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14969121#comment-14969121 ] Mark Miller commented on SOLR-8173: --- You are doing this test with version 5.2.1? > CLONE - Leader recovery process can select the wrong leader if all replicas > for a shard are down and trying to recover as well as lose updates that > should have been recovered. > --- > > Key: SOLR-8173 > URL: https://issues.apache.org/jira/browse/SOLR-8173 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Reporter: Matteo Grolla >Assignee: Mark Miller >Priority: Critical > Labels: leader, recovery > Fix For: 5.2.1 > > Attachments: solr_8983.log, solr_8984.log > > > I'm doing this test > collection test is replicated on two solr nodes running on 8983, 8984 > using external zk > 1)turn on solr 8984 > 2)add,commit a doc x con solr 8983 > 3)turn off solr 8983 > 4)turn on solr 8984 > 5)shortly after (leader still not elected) turn on solr 8983 > 6)8984 is elected as leader > 7)doc x is present on 8983 but not on 8984 (check issuing a query) > In attachment are the solr.log files of both instances -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org