[ 
https://issues.apache.org/jira/browse/SOLR-12087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16408606#comment-16408606
 ] 

Varun Thacker commented on SOLR-12087:
--------------------------------------

Hi Dat,

Great catch!

 

A couple of minor comments about the patch:
 * The log.warn in ReplicaMutator , should we just remove it? Like to a user 
going through the logs he will be confused when he sees it and there is no 
action to be taken anyways. Maybe it could be a DEBUG log entry?
 * In DeleteReplicaTest  can we change {{e.printStackTrace();}} to be written 
out with the logger ?
 * Just curious as to why is there a {{Thread.sleep(2000);}} wait in the test 
code? 

 

 

For reference, I ran the test patch on master ( which has the new LIR code ) 
and all 10 runs passed

A few things worth noting were these log lines
{code:java}
7471 INFO (qtp1388245618-49) [n:127.0.0.1:56141_solr ] o.a.s.s.HttpSolrCall 
[admin] webapp=null path=/admin/cores 
params={deleteInstanceDir=true&core=deleteReplicaOnIndexing_shard1_replica_n1&qt=/admin/cores&deleteDataDir=true&action=UNLOAD&wt=javabin&version=2&deleteIndex=true}
 status=0 QTime=85
7559 INFO (qtp1216387063-400) [n:127.0.0.1:56142_solr c:deleteReplicaOnIndexing 
s:shard1 r:core_node4 x:deleteReplicaOnIndexing_shard1_replica_n2] 
o.a.s.c.ZkShardTerms Successful update of terms at 
/collections/deleteReplicaOnIndexing/terms/shard1 to 
Terms{values={core_node4=1}, version=3}
7559 INFO (qtp1216387063-289) [n:127.0.0.1:56142_solr c:deleteReplicaOnIndexing 
s:shard1 r:core_node4 x:deleteReplicaOnIndexing_shard1_replica_n2] 
o.a.s.c.ZkShardTerms Failed to save terms, version is not a match, retrying
7559 INFO (qtp1216387063-57) [n:127.0.0.1:56142_solr c:deleteReplicaOnIndexing 
s:shard1 r:core_node4 x:deleteReplicaOnIndexing_shard1_replica_n2] 
o.a.s.c.ZkShardTerms Failed to save terms, version is not a match, retrying
7560 INFO (qtp1216387063-319) [n:127.0.0.1:56142_solr c:deleteReplicaOnIndexing 
s:shard1 r:core_node4 x:deleteReplicaOnIndexing_shard1_replica_n2] 
o.a.s.c.ZkShardTerms Failed to save terms, version is not a match, retrying
7560 INFO (qtp1216387063-476) [n:127.0.0.1:56142_solr c:deleteReplicaOnIndexing 
s:shard1 r:core_node4 x:deleteReplicaOnIndexing_shard1_replica_n2] 
o.a.s.c.ZkShardTerms Failed to save terms, version is not a match, retrying
7559 INFO (qtp1216387063-423) [n:127.0.0.1:56142_solr c:deleteReplicaOnIndexing 
s:shard1 r:core_node4 x:deleteReplicaOnIndexing_shard1_replica_n2] 
o.a.s.c.ZkShardTerms Failed to save terms, version is not a match, retrying
7561 INFO (qtp1216387063-444) [n:127.0.0.1:56142_solr c:deleteReplicaOnIndexing 
s:shard1 r:core_node4 x:deleteReplicaOnIndexing_shard1_replica_n2] 
o.a.s.c.ZkShardTerms Failed to save terms, version is not a match, retrying
7560 INFO (qtp1216387063-325) [n:127.0.0.1:56142_solr c:deleteReplicaOnIndexing 
s:shard1 r:core_node4 x:deleteReplicaOnIndexing_shard1_replica_n2] 
o.a.s.c.ZkShardTerms Failed to save terms, version is not a match, retrying
7563 INFO (qtp1216387063-324) [n:127.0.0.1:56142_solr c:deleteReplicaOnIndexing 
s:shard1 r:core_node4 x:deleteReplicaOnIndexing_shard1_replica_n2] 
o.a.s.c.ZkShardTerms Failed to save terms, version is not a match, retrying
7561 INFO (qtp1216387063-321) [n:127.0.0.1:56142_solr c:deleteReplicaOnIndexing 
s:shard1 r:core_node4 x:deleteReplicaOnIndexing_shard1_replica_n2] 
o.a.s.c.ZkShardTerms Failed to save terms, version is not a match, retrying

....

7705 ERROR 
(updateExecutor-16-thread-94-processing-https:////127.0.0.1:56141//solr//deleteReplicaOnIndexing_shard1_replica_n1
 x:deleteReplicaOnIndexing_shard1_replica_n2 r:core_node4 
n:127.0.0.1:56142_solr s:shard1 c:deleteReplicaOnIndexing) 
[n:127.0.0.1:56142_solr c:deleteReplicaOnIndexing s:shard1 r:core_node4 
x:deleteReplicaOnIndexing_shard1_replica_n2] 
o.a.s.u.ErrorReportingConcurrentUpdateSolrClient error
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error 
from server at 
https://127.0.0.1:56141/solr/deleteReplicaOnIndexing_shard1_replica_n1: Can not 
find: /solr/deleteReplicaOnIndexing_shard1_replica_n1/update



request: 
https://127.0.0.1:56141/solr/deleteReplicaOnIndexing_shard1_replica_n1/update?update.distrib=FROMLEADER&distrib.from=https%3A%2F%2F127.0.0.1%3A56142%2Fsolr%2FdeleteReplicaOnIndexing_shard1_replica_n2%2F&wt=javabin&version=2

...

7751 WARN  (qtp1216387063-326) [n:127.0.0.1:56142_solr 
c:deleteReplicaOnIndexing s:shard1 r:core_node4 
x:deleteReplicaOnIndexing_shard1_replica_n2] 
o.a.s.u.p.DistributedUpdateProcessor Core core_node4 belonging to 
deleteReplicaOnIndexing shard1, does not have error'd node 
https://127.0.0.1:56141/solr/deleteReplicaOnIndexing_shard1_replica_n1/ as a 
replica. No request recovery command will be sent!{code}

> Deleting replicas sometimes fails and causes the replicas to exist in the 
> down state
> ------------------------------------------------------------------------------------
>
>                 Key: SOLR-12087
>                 URL: https://issues.apache.org/jira/browse/SOLR-12087
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrCloud
>    Affects Versions: 7.2
>            Reporter: Jerry Bao
>            Priority: Critical
>         Attachments: SOLR-12087.patch, SOLR-12087.test.patch, Screen Shot 
> 2018-03-16 at 11.50.32 AM.png
>
>
> Sometimes when deleting replicas, the replica fails to be removed from the 
> cluster state. This occurs especially when deleting replicas en mass; the 
> resulting cause is that the data is deleted but the replicas aren't removed 
> from the cluster state. Attempting to delete the downed replicas causes 
> failures because the core does not exist anymore.
> This also occurs when trying to move replicas, since that move is an add and 
> delete.
> Some more information regarding this issue; when the MOVEREPLICA command is 
> issued, the new replica is created successfully but the replica to be deleted 
> fails to be removed from state.json (the core is deleted though) and we see 
> two logs spammed.
>  # The node containing the leader replica continually (read every second) 
> attempts to initiate recovery on the replica and fails to do so because the 
> core does not exist. As a result it continually publishes a down state for 
> the replica to zookeeper.
>  # The deleted replica node spams that it cannot locate the core because it's 
> been deleted.
> During this period of time, we see an increase in ZK network connectivity 
> overall, until the replica is finally deleted (spamming DELETEREPLICA on the 
> shard until its removed from the state)
> My guess is there's two issues at hand here:
>  # The leader continually attempts to recover a downed replica that is 
> unrecoverable because the core does not exist.
>  # The replica to be deleted is having trouble being deleted from state.json 
> in ZK.
> This is mostly consistent for my use case. I'm running 7.2.1 with 66 nodes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to