[jira] [Commented] (SOLR-8173) CLONE - Leader recovery process can select the wrong leader if all replicas for a shard are down and trying to recover as well as lose updates that should have been recovered.

Varun Thacker (JIRA) Wed, 04 Nov 2015 19:17:43 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-8173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14991039#comment-14991039
 ]


Varun Thacker commented on SOLR-8173:
-------------------------------------

I tried to reproduce this and I think there could be two bugs in play here:

1. The bug Matteo mentioned . These were the steps I used to reproduce it

{code}
./bin/solr start -e cloud -noprompt -z localhost:2181

http://localhost:8983/solr/admin/collections?action=CREATE&name=test3&collection.configName=gettingstarted&numShards=1&replicationFactor=2

core_node1 = core_node2 = active

./bin/solr stop -p 7574

core_node2 = down

curl http://127.0.0.1:8983/solr/test3/update?commit=true -H 
'Content-type:application/json' -d '[{"id" : "1"}]'

./bin/solr stop -p 8983

./bin/solr start -c -z localhost:2181 -s example/cloud/node2/solr -p 7574; 
sleep 10; ./bin/solr start -c -z localhost:2181 -s example/cloud/node1/solr -p 
8983

At this point both replicas are 'ACTIVE' , replica 2 becomes the leader and the 
collection has 0 documents.
{code}

2. A slight variation of the test also leads to lost updates. These were the 
steps I used to reproduce it.

{code}
./bin/solr start -e cloud -noprompt -z localhost:2181

http://localhost:8983/solr/admin/collections?action=CREATE&name=test1&collection.configName=gettingstarted&numShards=1&replicationFactor=2

core_node1 = core_node2 = active

./bin/solr stop -p 7574

core_node2 = down

curl http://127.0.0.1:8983/solr/test1/update?commit=true -H 
'Content-type:application/json' -d '[{"id" : "1"}]'

./bin/solr stop -p 8983

./bin/solr start -c -z localhost:2181 -s example/cloud/node2/solr -p 7574
{code}

{code}
Replica 2 does not take leadership till timeout. It stays in down state.

INFO  - 2015-10-26 23:15:53.026; [c:test1 s:shard1 r:core_node2 
x:test1_shard1_replica1] org.apache.solr.cloud.ShardLeaderElectionContext; 
Waiting until we see more replicas up for shard shard1: total=2 found=1 
timeoutin=139681ms

Replica 2 becomes leader after timeout

INFO  - 2015-10-26 23:18:13.127; [c:test1 s:shard1 r:core_node2 
x:test1_shard1_replica1] org.apache.solr.cloud.ShardLeaderElectionContext; Was 
waiting for replicas to come up, but they are taking too long - assuming they 
won't come back till later
INFO  - 2015-10-26 23:18:13.128; [c:test1 s:shard1 r:core_node2 
x:test1_shard1_replica1] org.apache.solr.cloud.ShardLeaderElectionContext; I 
may be the new leader - try and sync
INFO  - 2015-10-26 23:18:13.129; [c:test1 s:shard1 r:core_node2 
x:test1_shard1_replica1] org.apache.solr.cloud.SyncStrategy; Sync replicas to 
http://192.168.1.9:7574/solr/test1_shard1_replica1/
INFO  - 2015-10-26 23:18:13.129; [c:test1 s:shard1 r:core_node2 
x:test1_shard1_replica1] org.apache.solr.cloud.SyncStrategy; Sync Success - now 
sync replicas to me
INFO  - 2015-10-26 23:18:13.130; [c:test1 s:shard1 r:core_node2 
x:test1_shard1_replica1] org.apache.solr.cloud.SyncStrategy; 
http://192.168.1.9:7574/solr/test1_shard1_replica1/ has no replicas
INFO  - 2015-10-26 23:18:13.131; [c:test1 s:shard1 r:core_node2 
x:test1_shard1_replica1] org.apache.solr.cloud.ShardLeaderElectionContext; I am 
the new leader: http://192.168.1.9:7574/solr/test1_shard1_replica1/ shard1
{code}

So for the first case I am guessing that the znode at the head gets picked up 
as the leader when all replicas are active. If thats the case can we pick the 
replica which has the latest data ?
In the second case after the timeout a replica can become a leader. Thinking 
aloud should we mark the replica as recovery failed instead by default and have 
a parameter which when specified allows any replica to become the leader? 

> CLONE - Leader recovery process can select the wrong leader if all replicas 
> for a shard are down and trying to recover as well as lose updates that 
> should have been recovered.
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-8173
>                 URL: https://issues.apache.org/jira/browse/SOLR-8173
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>            Reporter: Matteo Grolla
>            Assignee: Mark Miller
>            Priority: Critical
>              Labels: leader, recovery
>             Fix For: 5.2.1
>
>         Attachments: solr_8983.log, solr_8984.log
>
>
> I'm doing this test
> collection test is replicated on two solr nodes running on 8983, 8984
> using external zk
> initially both nodes are empty
> 1)turn on solr 8983
> 2)add,commit a doc x con solr 8983
> 3)turn off solr 8983
> 4)turn on solr 8984
> 5)shortly after (leader still not elected) turn on solr 8983
> 6)8984 is elected as leader
> 7)doc x is present on 8983 but not on 8984 (check issuing a query)
> In attachment are the solr.log files of both instances



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-8173) CLONE - Leader recovery process can select the wrong leader if all replicas for a shard are down and trying to recover as well as lose updates that should have been recovered.

Reply via email to