[ 
https://issues.apache.org/jira/browse/SOLR-12087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16407442#comment-16407442
 ] 

Cao Manh Dat edited comment on SOLR-12087 at 3/21/18 4:43 AM:
--------------------------------------------------------------

I'm not sure but this can be a race condition between delete replica and updates
 * Many updates are coming to the leader
 * The leader forward these updates to replicaA
 * DeleteReplica is called for replicaA
 * There are several updates sent to replicaA failed ( because replicaA is 
closed ) 
 * Entry of replicaA is removed from {{states.json}}
 * The leader put replicaA into LIR by publishing replicaA state to DOWN which 
adds back the entry of replicaA to {{states.json}}

[~jerry.bao] If this is your case there must be some log like this
 * On replica node on time t : log.info(logid+" CLOSING SolrCore " + this);
 * On leader node on time t+delta : log.warn("Leader is publishing core={} 
coreNodeName ={} state={} on behalf of un-reachable replica {}",
 replicaCoreName, replicaCoreNodeName, Replica.State.DOWN.toString(), 
replicaUrl);

BTW: The above case is fixed by SOLR-11702


was (Author: caomanhdat):
I'm not sure but this can be a race condition between delete replica and updates
 * Many updates are coming to the leader
 * The leader forward these updates to replicaA
 * DeleteReplica is called for replicaA
 * There are several updates sent to replicaA failed ( because replicaA is 
closed ) 
 * Entry of replicaA is removed from {{states.json}}
 * The leader put replicaA into LIR by publishing replicaA state to DOWN which 
adds back the entry of replicaA to {{states.json}}

[~jerry.bao] If this is your case there must be some log like this
 * On replica node on time t : log.info(logid+" CLOSING SolrCore " + this);
 * On leader node on time t+delta : log.warn("Leader is publishing core={} 
coreNodeName ={} state={} on behalf of un-reachable replica {}",
replicaCoreName, replicaCoreNodeName, Replica.State.DOWN.toString(), 
replicaUrl);

The above case is fixed by SOLR-11702

> Deleting replicas sometimes fails and causes the replicas to exist in the 
> down state
> ------------------------------------------------------------------------------------
>
>                 Key: SOLR-12087
>                 URL: https://issues.apache.org/jira/browse/SOLR-12087
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrCloud
>    Affects Versions: 7.2
>            Reporter: Jerry Bao
>            Priority: Critical
>         Attachments: Screen Shot 2018-03-16 at 11.50.32 AM.png
>
>
> Sometimes when deleting replicas, the replica fails to be removed from the 
> cluster state. This occurs especially when deleting replicas en mass; the 
> resulting cause is that the data is deleted but the replicas aren't removed 
> from the cluster state. Attempting to delete the downed replicas causes 
> failures because the core does not exist anymore.
> This also occurs when trying to move replicas, since that move is an add and 
> delete.
> Some more information regarding this issue; when the MOVEREPLICA command is 
> issued, the new replica is created successfully but the replica to be deleted 
> fails to be removed from state.json (the core is deleted though) and we see 
> two logs spammed.
>  # The node containing the leader replica continually (read every second) 
> attempts to initiate recovery on the replica and fails to do so because the 
> core does not exist. As a result it continually publishes a down state for 
> the replica to zookeeper.
>  # The deleted replica node spams that it cannot locate the core because it's 
> been deleted.
> During this period of time, we see an increase in ZK network connectivity 
> overall, until the replica is finally deleted (spamming DELETEREPLICA on the 
> shard until its removed from the state)
> My guess is there's two issues at hand here:
>  # The leader continually attempts to recover a downed replica that is 
> unrecoverable because the core does not exist.
>  # The replica to be deleted is having trouble being deleted from state.json 
> in ZK.
> This is mostly consistent for my use case. I'm running 7.2.1 with 66 nodes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to