[ 
https://issues.apache.org/jira/browse/SOLR-15288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17311950#comment-17311950
 ] 

Ishan Chattopadhyaya commented on SOLR-15288:
---------------------------------------------

After Noble and I banging our heads for some time on this (and observing weird 
patterns of behaviour from Solr), we figured out the following:

# This is problem manifests itself in the dev mode only ("ant server; bin/solr 
-c -p XXX"). If the nodes are started from different directories (as one would 
do in production), this error doesn't ensue.
# This error is present even without any of the PRS changes. As an example, 
repeating all the steps above with "perReplicaState=false" results in the same 
problem.

We had tested the PRS feature using a stress test suite that spun up Solr nodes 
all from different directories, and hence we never encountered this. The 
bizzare nature of this problem is such that one of the symptoms is stopping one 
node causes a DOWNNODE message to be generated from one of the live nodes.

As part of this testing, we added some more unit tests and some more defensive 
fixes, all of which are useful (but likely not necessary). We can add those 
tests and fixes here, but deal with the broader (but non-important) issue of 
fixing the dev testing workflow that results in such problems in a separate 
JIRA issue.

> PRS replicas stay DOWN after a new node is restarted
> ----------------------------------------------------
>
>                 Key: SOLR-15288
>                 URL: https://issues.apache.org/jira/browse/SOLR-15288
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>    Affects Versions: 8.8.1
>            Reporter: Ishan Chattopadhyaya
>            Priority: Critical
>
> After a PRS collection is created using a single node cluster, and a new node 
> is added and a replica for that collection is placed on the new node, 
> restarting that new node causes problems with replica states.
> Reproduce script:
> {code}
> # Start a fresh ZK on 2181
> # docker container prune -f && docker run -it -p 2181:2181 --name=zk1 -h zk1 
> zookeeper:3.5.6
> rm -rf server/logs/*
> bin/solr stop -all
> rm -rf server/solr/mycoll_shard1_replica_n1/ 
> server/solr/mycoll_shard1_replica_n3/
> bin/solr -c -p 9000 -z localhost:2181
> curl 
> "http://localhost:9000/solr/admin/collections?action=CREATE&name=mycoll&numShards=1&perReplicaState=true";
> bin/solr -c -p 9001 -z localhost:2181
> curl 
> "http://localhost:9000/solr/admin/collections?action=ADDREPLICA&collection=mycoll&shard=shard1";
> bin/solr stop -p 9001
> bin/solr -c -p 9001 -z localhost:2181
> {code}
> Two problems:
> 1. Now look at the two replicas, both are down. 
> 2. Also, as [~hitesh.khamesra] found out, the second replica stays ACTIVE 
> (not DOWN) after the second node (9001) is stopped.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to