[jira] [Commented] (SOLR-10704) REPLACENODE can make the collection lost data which replicaFactor is 1

Andrzej Bialecki (JIRA) Thu, 08 Jun 2017 05:26:33 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-10704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16042621#comment-16042621
 ]


Andrzej Bialecki  commented on SOLR-10704:
------------------------------------------

I can reproduce this also using a 2 node setup, using HDFS collection with 2 
shards and 1 replica. It appears that the REPLACENODE deletes the original 
replica (which is also the replica leader) while the new replica is starting 
the recovery - at which point the recovery fails. Then it tries to find a shard 
leader, which no longer exists...

{code}
2017-06-08 12:11:49.760 INFO  
(recoveryExecutor-3-thread-1-processing-n:192.168.0.202:8983_solr 
x:gettingstarted_shard1_replica2 s:shard1 c:gettingstarted r:core_node3) 
[c:gettingstarted s:shard1 r:core_node3 x:gettingstarted_shard1_replica2] 
o.a.s.c.RecoveryStrategy Attempting to PeerSync from 
[http://192.168.0.201:8983/solr/gettingstarted_shard1_replica1/] - 
recoveringAfterStartup=[true]
2017-06-08 12:11:49.789 INFO  
(recoveryExecutor-3-thread-1-processing-n:192.168.0.202:8983_solr 
x:gettingstarted_shard1_replica2 s:shard1 c:gettingstarted r:core_node3) 
[c:gettingstarted s:shard1 r:core_node3 x:gettingstarted_shard1_replica2] 
o.a.s.u.PeerSync PeerSync: core=gettingstarted_shard1_replica2 
url=http://192.168.0.202:8983/solr START 
replicas=[http://192.168.0.201:8983/solr/gettingstarted_shard1_replica1/] 
nUpdates=100
2017-06-08 12:11:49.856 ERROR 
(recoveryExecutor-3-thread-1-processing-n:192.168.0.202:8983_solr 
x:gettingstarted_shard1_replica2 s:shard1 c:gettingstarted r:core_node3) 
[c:gettingstarted s:shard1 r:core_node3 x:gettingstarted_shard1_replica2] 
o.a.s.c.RecoveryStrategy Error while trying to recover. 
core=gettingstarted_shard1_replica2:java.lang.NullPointerException
        at org.apache.solr.update.PeerSync.alreadyInSync(PeerSync.java:340)
        at org.apache.solr.update.PeerSync.sync(PeerSync.java:223)
        at 
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:376)
        at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:221)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

...

2017-06-08 12:12:03.826 INFO  
(zkCallback-4-thread-2-processing-n:192.168.0.202:8983_solr) [c:gettingstarted 
s:shard2 r:core_node4 x:gettingstarted_shard2_replica2] o.a.s.c.ActionThrottle 
The last leader attempt started 34ms ago.
2017-06-08 12:12:03.827 INFO  
(zkCallback-4-thread-2-processing-n:192.168.0.202:8983_solr) [c:gettingstarted 
s:shard2 r:core_node4 x:gettingstarted_shard2_replica2] o.a.s.c.ActionThrottle 
Throttling leader attempts - waiting for 4965ms
2017-06-08 12:12:03.873 ERROR 
(recoveryExecutor-3-thread-1-processing-n:192.168.0.202:8983_solr 
x:gettingstarted_shard1_replica2 s:shard1 c:gettingstarted r:core_node3) 
[c:gettingstarted s:shard1 r:core_node3 x:gettingstarted_shard1_replica2] 
o.a.s.c.RecoveryStrategy Error while trying to recover. 
core=gettingstarted_shard1_replica2:org.apache.solr.common.SolrException: No 
registered leader was found after waiting for 4000ms , collection: 
gettingstarted slice: shard1
        at 
org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:747)
        at 
org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:733)
        at 
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:305)
        at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:221)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
{code}

The main problem here is that the original replica is deleted before the new 
replica fully recovers - the code should wait until this happens when there's 
only one active replica left in the cluster. This should also consider a 
scenario when there are several replicas of the same shard on the same node, 
and again the code has to wait with deleting them before at least one new 
replica has fully recovered.

> REPLACENODE can make the collection lost data which replicaFactor is 1 
> -----------------------------------------------------------------------
>
>                 Key: SOLR-10704
>                 URL: https://issues.apache.org/jira/browse/SOLR-10704
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrCloud
>    Affects Versions: 6.2
>         Environment: Red Hat 4.8.3-9, JDK 1.8.0_121
>            Reporter: Daisy.Yuan
>            Assignee: Andrzej Bialecki 
>             Fix For: master (7.0), 6.7
>
>         Attachments: 219.log
>
>
> When some replicas which the relative collection's replicaFactor is 1, it 
> will lost data after executing the REPLACENODE cmd. 
> It may be the new replica on the target node does not complete revovering, 
> but the old replica on the source node  was already be deleted.
> At last the target revocery failed for the following exception:
> 2017-05-18 17:08:48,587 | ERROR | 
> recoveryExecutor-3-thread-2-processing-n:192.168.229.137:21103_solr 
> x:replace-hdfs-coll1_shard1_replica2 s:shard1 c:replace-hdfs-coll1 
> r:core_node3 | Error while trying to recover. 
> core=replace-hdfs-coll1_shard1_replica2:java.lang.NullPointerException
>         at org.apache.solr.update.PeerSync.alreadyInSync(PeerSync.java:339)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-10704) REPLACENODE can make the collection lost data which replicaFactor is 1

Reply via email to