[jira] [Commented] (SOLR-8173) CLONE - Leader recovery process can select the wrong leader if all replicas for a shard are down and trying to recover as well as lose updates that should have been reco

2017-03-03 Thread Frank Kelly (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15894821#comment-15894821
 ] 

Frank Kelly commented on SOLR-8173:
---

I agree. 
This is a critical problem when ZooKeeper and Solr disagree as who the leader 
there needs to be a winner rather stay in some unrecoverable state. Even if it 
just randomly picked one shard - a fully operational but slightly "off" search 
index is better than no index at all.



> CLONE - Leader recovery process can select the wrong leader if all replicas 
> for a shard are down and trying to recover as well as lose updates that 
> should have been recovered.
> ---
>
> Key: SOLR-8173
> URL: https://issues.apache.org/jira/browse/SOLR-8173
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Reporter: Matteo Grolla
>Assignee: Mark Miller
>Priority: Critical
>  Labels: leader, recovery
> Attachments: solr_8983.log, solr_8984.log
>
>
> I'm doing this test
> collection test is replicated on two solr nodes running on 8983, 8984
> using external zk
> initially both nodes are empty
> 1)turn on solr 8983
> 2)add,commit a doc x con solr 8983
> 3)turn off solr 8983
> 4)turn on solr 8984
> 5)shortly after (leader still not elected) turn on solr 8983
> 6)8984 is elected as leader
> 7)doc x is present on 8983 but not on 8984 (check issuing a query)
> In attachment are the solr.log files of both instances



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8173) CLONE - Leader recovery process can select the wrong leader if all replicas for a shard are down and trying to recover as well as lose updates that should have been reco

2017-03-03 Thread Amrit Sarkar (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15894789#comment-15894789
 ] 

Amrit Sarkar commented on SOLR-8173:


Are we planning to resolve this any time soon?

> CLONE - Leader recovery process can select the wrong leader if all replicas 
> for a shard are down and trying to recover as well as lose updates that 
> should have been recovered.
> ---
>
> Key: SOLR-8173
> URL: https://issues.apache.org/jira/browse/SOLR-8173
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Reporter: Matteo Grolla
>Assignee: Mark Miller
>Priority: Critical
>  Labels: leader, recovery
> Attachments: solr_8983.log, solr_8984.log
>
>
> I'm doing this test
> collection test is replicated on two solr nodes running on 8983, 8984
> using external zk
> initially both nodes are empty
> 1)turn on solr 8983
> 2)add,commit a doc x con solr 8983
> 3)turn off solr 8983
> 4)turn on solr 8984
> 5)shortly after (leader still not elected) turn on solr 8983
> 6)8984 is elected as leader
> 7)doc x is present on 8983 but not on 8984 (check issuing a query)
> In attachment are the solr.log files of both instances



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8173) CLONE - Leader recovery process can select the wrong leader if all replicas for a shard are down and trying to recover as well as lose updates that should have been reco

2016-02-02 Thread Stephan Lagraulet (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15128249#comment-15128249
 ] 

Stephan Lagraulet commented on SOLR-8173:
-

Thanks, sounds like a good process.

> CLONE - Leader recovery process can select the wrong leader if all replicas 
> for a shard are down and trying to recover as well as lose updates that 
> should have been recovered.
> ---
>
> Key: SOLR-8173
> URL: https://issues.apache.org/jira/browse/SOLR-8173
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Reporter: Matteo Grolla
>Assignee: Mark Miller
>Priority: Critical
>  Labels: leader, recovery
> Attachments: solr_8983.log, solr_8984.log
>
>
> I'm doing this test
> collection test is replicated on two solr nodes running on 8983, 8984
> using external zk
> initially both nodes are empty
> 1)turn on solr 8983
> 2)add,commit a doc x con solr 8983
> 3)turn off solr 8983
> 4)turn on solr 8984
> 5)shortly after (leader still not elected) turn on solr 8983
> 6)8984 is elected as leader
> 7)doc x is present on 8983 but not on 8984 (check issuing a query)
> In attachment are the solr.log files of both instances



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8173) CLONE - Leader recovery process can select the wrong leader if all replicas for a shard are down and trying to recover as well as lose updates that should have been reco

2016-02-01 Thread Stephan Lagraulet (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15126098#comment-15126098
 ] 

Stephan Lagraulet commented on SOLR-8173:
-

Can you remove Fix version 5.2.1 if this bug is not resolved?

> CLONE - Leader recovery process can select the wrong leader if all replicas 
> for a shard are down and trying to recover as well as lose updates that 
> should have been recovered.
> ---
>
> Key: SOLR-8173
> URL: https://issues.apache.org/jira/browse/SOLR-8173
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Reporter: Matteo Grolla
>Assignee: Mark Miller
>Priority: Critical
>  Labels: leader, recovery
> Fix For: 5.2.1
>
> Attachments: solr_8983.log, solr_8984.log
>
>
> I'm doing this test
> collection test is replicated on two solr nodes running on 8983, 8984
> using external zk
> initially both nodes are empty
> 1)turn on solr 8983
> 2)add,commit a doc x con solr 8983
> 3)turn off solr 8983
> 4)turn on solr 8984
> 5)shortly after (leader still not elected) turn on solr 8983
> 6)8984 is elected as leader
> 7)doc x is present on 8983 but not on 8984 (check issuing a query)
> In attachment are the solr.log files of both instances



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8173) CLONE - Leader recovery process can select the wrong leader if all replicas for a shard are down and trying to recover as well as lose updates that should have been reco

2016-02-01 Thread Shawn Heisey (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15126461#comment-15126461
 ] 

Shawn Heisey commented on SOLR-8173:


[~stephlag], I have removed 5.2.1 from the "fix version" list.

Typically this field is meaningless if the issue is unresolved.  Some people do 
populate it when creating an issue, to indicate the version they *think* it 
should be fixed in, but it doesn't mean anything until a fix is committed and 
the issue is resolved as Fixed.  When I start working on an issue, I will 
usually blank out "fix version", unless I happen to know with complete 
certainty when I will finish the work and which version will contain the fix.  
Knowing this with complete certainty is rare.

> CLONE - Leader recovery process can select the wrong leader if all replicas 
> for a shard are down and trying to recover as well as lose updates that 
> should have been recovered.
> ---
>
> Key: SOLR-8173
> URL: https://issues.apache.org/jira/browse/SOLR-8173
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Reporter: Matteo Grolla
>Assignee: Mark Miller
>Priority: Critical
>  Labels: leader, recovery
> Attachments: solr_8983.log, solr_8984.log
>
>
> I'm doing this test
> collection test is replicated on two solr nodes running on 8983, 8984
> using external zk
> initially both nodes are empty
> 1)turn on solr 8983
> 2)add,commit a doc x con solr 8983
> 3)turn off solr 8983
> 4)turn on solr 8984
> 5)shortly after (leader still not elected) turn on solr 8983
> 6)8984 is elected as leader
> 7)doc x is present on 8983 but not on 8984 (check issuing a query)
> In attachment are the solr.log files of both instances



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8173) CLONE - Leader recovery process can select the wrong leader if all replicas for a shard are down and trying to recover as well as lose updates that should have been reco

2015-11-04 Thread Varun Thacker (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14991039#comment-14991039
 ] 

Varun Thacker commented on SOLR-8173:
-

I tried to reproduce this and I think there could be two bugs in play here:

1. The bug Matteo mentioned . These were the steps I used to reproduce it

{code}
./bin/solr start -e cloud -noprompt -z localhost:2181

http://localhost:8983/solr/admin/collections?action=CREATE=test3=gettingstarted=1=2

core_node1 = core_node2 = active

./bin/solr stop -p 7574

core_node2 = down

curl http://127.0.0.1:8983/solr/test3/update?commit=true -H 
'Content-type:application/json' -d '[{"id" : "1"}]'

./bin/solr stop -p 8983

./bin/solr start -c -z localhost:2181 -s example/cloud/node2/solr -p 7574; 
sleep 10; ./bin/solr start -c -z localhost:2181 -s example/cloud/node1/solr -p 
8983

At this point both replicas are 'ACTIVE' , replica 2 becomes the leader and the 
collection has 0 documents.
{code}

2. A slight variation of the test also leads to lost updates. These were the 
steps I used to reproduce it.

{code}
./bin/solr start -e cloud -noprompt -z localhost:2181

http://localhost:8983/solr/admin/collections?action=CREATE=test1=gettingstarted=1=2

core_node1 = core_node2 = active

./bin/solr stop -p 7574

core_node2 = down

curl http://127.0.0.1:8983/solr/test1/update?commit=true -H 
'Content-type:application/json' -d '[{"id" : "1"}]'

./bin/solr stop -p 8983

./bin/solr start -c -z localhost:2181 -s example/cloud/node2/solr -p 7574
{code}

{code}
Replica 2 does not take leadership till timeout. It stays in down state.

INFO  - 2015-10-26 23:15:53.026; [c:test1 s:shard1 r:core_node2 
x:test1_shard1_replica1] org.apache.solr.cloud.ShardLeaderElectionContext; 
Waiting until we see more replicas up for shard shard1: total=2 found=1 
timeoutin=139681ms

Replica 2 becomes leader after timeout

INFO  - 2015-10-26 23:18:13.127; [c:test1 s:shard1 r:core_node2 
x:test1_shard1_replica1] org.apache.solr.cloud.ShardLeaderElectionContext; Was 
waiting for replicas to come up, but they are taking too long - assuming they 
won't come back till later
INFO  - 2015-10-26 23:18:13.128; [c:test1 s:shard1 r:core_node2 
x:test1_shard1_replica1] org.apache.solr.cloud.ShardLeaderElectionContext; I 
may be the new leader - try and sync
INFO  - 2015-10-26 23:18:13.129; [c:test1 s:shard1 r:core_node2 
x:test1_shard1_replica1] org.apache.solr.cloud.SyncStrategy; Sync replicas to 
http://192.168.1.9:7574/solr/test1_shard1_replica1/
INFO  - 2015-10-26 23:18:13.129; [c:test1 s:shard1 r:core_node2 
x:test1_shard1_replica1] org.apache.solr.cloud.SyncStrategy; Sync Success - now 
sync replicas to me
INFO  - 2015-10-26 23:18:13.130; [c:test1 s:shard1 r:core_node2 
x:test1_shard1_replica1] org.apache.solr.cloud.SyncStrategy; 
http://192.168.1.9:7574/solr/test1_shard1_replica1/ has no replicas
INFO  - 2015-10-26 23:18:13.131; [c:test1 s:shard1 r:core_node2 
x:test1_shard1_replica1] org.apache.solr.cloud.ShardLeaderElectionContext; I am 
the new leader: http://192.168.1.9:7574/solr/test1_shard1_replica1/ shard1
{code}

So for the first case I am guessing that the znode at the head gets picked up 
as the leader when all replicas are active. If thats the case can we pick the 
replica which has the latest data ?
In the second case after the timeout a replica can become a leader. Thinking 
aloud should we mark the replica as recovery failed instead by default and have 
a parameter which when specified allows any replica to become the leader? 

> CLONE - Leader recovery process can select the wrong leader if all replicas 
> for a shard are down and trying to recover as well as lose updates that 
> should have been recovered.
> ---
>
> Key: SOLR-8173
> URL: https://issues.apache.org/jira/browse/SOLR-8173
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Reporter: Matteo Grolla
>Assignee: Mark Miller
>Priority: Critical
>  Labels: leader, recovery
> Fix For: 5.2.1
>
> Attachments: solr_8983.log, solr_8984.log
>
>
> I'm doing this test
> collection test is replicated on two solr nodes running on 8983, 8984
> using external zk
> initially both nodes are empty
> 1)turn on solr 8983
> 2)add,commit a doc x con solr 8983
> 3)turn off solr 8983
> 4)turn on solr 8984
> 5)shortly after (leader still not elected) turn on solr 8983
> 6)8984 is elected as leader
> 7)doc x is present on 8983 but not on 8984 (check issuing a query)
> In attachment are the solr.log files of both instances



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: 

[jira] [Commented] (SOLR-8173) CLONE - Leader recovery process can select the wrong leader if all replicas for a shard are down and trying to recover as well as lose updates that should have been reco

2015-10-28 Thread Matteo Grolla (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14978644#comment-14978644
 ] 

Matteo Grolla commented on SOLR-8173:
-

Hi Mark,
to me the problem happens also with a nonempty index, to reproduce:

initially both nodes 
-are shut down
-CONTAIN DOCUMENT X

1)turn on solr 8983
2)add,commit a doc y con solr 8983
3)turn off solr 8983
4)turn on solr 8984
5)shortly after (leader still not elected) turn on solr 8983
6)8984 is elected as leader
7)doc y is present on 8983 but not on 8984 (check issuing a query), which only 
contains document x

How can I test a scenario where the default leaderVoteWait makes the right 
leader be elected?



> CLONE - Leader recovery process can select the wrong leader if all replicas 
> for a shard are down and trying to recover as well as lose updates that 
> should have been recovered.
> ---
>
> Key: SOLR-8173
> URL: https://issues.apache.org/jira/browse/SOLR-8173
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Reporter: Matteo Grolla
>Assignee: Mark Miller
>Priority: Critical
>  Labels: leader, recovery
> Fix For: 5.2.1
>
> Attachments: solr_8983.log, solr_8984.log
>
>
> I'm doing this test
> collection test is replicated on two solr nodes running on 8983, 8984
> using external zk
> initially both nodes are empty
> 1)turn on solr 8983
> 2)add,commit a doc x con solr 8983
> 3)turn off solr 8983
> 4)turn on solr 8984
> 5)shortly after (leader still not elected) turn on solr 8983
> 6)8984 is elected as leader
> 7)doc x is present on 8983 but not on 8984 (check issuing a query)
> In attachment are the solr.log files of both instances



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8173) CLONE - Leader recovery process can select the wrong leader if all replicas for a shard are down and trying to recover as well as lose updates that should have been reco

2015-10-24 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14972854#comment-14972854
 ] 

Mark Miller commented on SOLR-8173:
---

>From my testing with this type of use case, the problem is how we are dealing 
>with an empty index.

The leader election needs to be smarter about knowing if an empty index is a 
good candidate to be leader.

We need something that is part of the sync up phase that checks if any 
participating replicas have any tlogs. If they do, a replica with no tlogs 
should not become leader.

> CLONE - Leader recovery process can select the wrong leader if all replicas 
> for a shard are down and trying to recover as well as lose updates that 
> should have been recovered.
> ---
>
> Key: SOLR-8173
> URL: https://issues.apache.org/jira/browse/SOLR-8173
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Reporter: Matteo Grolla
>Assignee: Mark Miller
>Priority: Critical
>  Labels: leader, recovery
> Fix For: 5.2.1
>
> Attachments: solr_8983.log, solr_8984.log
>
>
> I'm doing this test
> collection test is replicated on two solr nodes running on 8983, 8984
> using external zk
> 1)turn on solr 8984
> 2)add,commit a doc x con solr 8983
> 3)turn off solr 8983
> 4)turn on solr 8984
> 5)shortly after (leader still not elected) turn on solr 8983
> 6)8984 is elected as leader
> 7)doc x is present on 8983 but not on 8984 (check issuing a query)
> In attachment are the solr.log files of both instances



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8173) CLONE - Leader recovery process can select the wrong leader if all replicas for a shard are down and trying to recover as well as lose updates that should have been reco

2015-10-23 Thread Matteo Grolla (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14971207#comment-14971207
 ] 

Matteo Grolla commented on SOLR-8173:
-

Yes,
Unpacked zip
Cloned server folder
Started 2 node cluster using bin/solr script
Created 'schemaless' collection using bin/solr script and ran the described
test.



> CLONE - Leader recovery process can select the wrong leader if all replicas 
> for a shard are down and trying to recover as well as lose updates that 
> should have been recovered.
> ---
>
> Key: SOLR-8173
> URL: https://issues.apache.org/jira/browse/SOLR-8173
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Reporter: Matteo Grolla
>Assignee: Mark Miller
>Priority: Critical
>  Labels: leader, recovery
> Fix For: 5.2.1
>
> Attachments: solr_8983.log, solr_8984.log
>
>
> I'm doing this test
> collection test is replicated on two solr nodes running on 8983, 8984
> using external zk
> 1)turn on solr 8984
> 2)add,commit a doc x con solr 8983
> 3)turn off solr 8983
> 4)turn on solr 8984
> 5)shortly after (leader still not elected) turn on solr 8983
> 6)8984 is elected as leader
> 7)doc x is present on 8983 but not on 8984 (check issuing a query)
> In attachment are the solr.log files of both instances



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8173) CLONE - Leader recovery process can select the wrong leader if all replicas for a shard are down and trying to recover as well as lose updates that should have been reco

2015-10-22 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14969121#comment-14969121
 ] 

Mark Miller commented on SOLR-8173:
---

You are doing this test with version 5.2.1?

> CLONE - Leader recovery process can select the wrong leader if all replicas 
> for a shard are down and trying to recover as well as lose updates that 
> should have been recovered.
> ---
>
> Key: SOLR-8173
> URL: https://issues.apache.org/jira/browse/SOLR-8173
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Reporter: Matteo Grolla
>Assignee: Mark Miller
>Priority: Critical
>  Labels: leader, recovery
> Fix For: 5.2.1
>
> Attachments: solr_8983.log, solr_8984.log
>
>
> I'm doing this test
> collection test is replicated on two solr nodes running on 8983, 8984
> using external zk
> 1)turn on solr 8984
> 2)add,commit a doc x con solr 8983
> 3)turn off solr 8983
> 4)turn on solr 8984
> 5)shortly after (leader still not elected) turn on solr 8983
> 6)8984 is elected as leader
> 7)doc x is present on 8983 but not on 8984 (check issuing a query)
> In attachment are the solr.log files of both instances



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org