[
https://issues.apache.org/jira/browse/SOLR-10704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16042621#comment-16042621
]
Andrzej Bialecki commented on SOLR-10704:
------------------------------------------
I can reproduce this also using a 2 node setup, using HDFS collection with 2
shards and 1 replica. It appears that the REPLACENODE deletes the original
replica (which is also the replica leader) while the new replica is starting
the recovery - at which point the recovery fails. Then it tries to find a shard
leader, which no longer exists...
{code}
2017-06-08 12:11:49.760 INFO
(recoveryExecutor-3-thread-1-processing-n:192.168.0.202:8983_solr
x:gettingstarted_shard1_replica2 s:shard1 c:gettingstarted r:core_node3)
[c:gettingstarted s:shard1 r:core_node3 x:gettingstarted_shard1_replica2]
o.a.s.c.RecoveryStrategy Attempting to PeerSync from
[http://192.168.0.201:8983/solr/gettingstarted_shard1_replica1/] -
recoveringAfterStartup=[true]
2017-06-08 12:11:49.789 INFO
(recoveryExecutor-3-thread-1-processing-n:192.168.0.202:8983_solr
x:gettingstarted_shard1_replica2 s:shard1 c:gettingstarted r:core_node3)
[c:gettingstarted s:shard1 r:core_node3 x:gettingstarted_shard1_replica2]
o.a.s.u.PeerSync PeerSync: core=gettingstarted_shard1_replica2
url=http://192.168.0.202:8983/solr START
replicas=[http://192.168.0.201:8983/solr/gettingstarted_shard1_replica1/]
nUpdates=100
2017-06-08 12:11:49.856 ERROR
(recoveryExecutor-3-thread-1-processing-n:192.168.0.202:8983_solr
x:gettingstarted_shard1_replica2 s:shard1 c:gettingstarted r:core_node3)
[c:gettingstarted s:shard1 r:core_node3 x:gettingstarted_shard1_replica2]
o.a.s.c.RecoveryStrategy Error while trying to recover.
core=gettingstarted_shard1_replica2:java.lang.NullPointerException
at org.apache.solr.update.PeerSync.alreadyInSync(PeerSync.java:340)
at org.apache.solr.update.PeerSync.sync(PeerSync.java:223)
at
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:376)
at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:221)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
...
2017-06-08 12:12:03.826 INFO
(zkCallback-4-thread-2-processing-n:192.168.0.202:8983_solr) [c:gettingstarted
s:shard2 r:core_node4 x:gettingstarted_shard2_replica2] o.a.s.c.ActionThrottle
The last leader attempt started 34ms ago.
2017-06-08 12:12:03.827 INFO
(zkCallback-4-thread-2-processing-n:192.168.0.202:8983_solr) [c:gettingstarted
s:shard2 r:core_node4 x:gettingstarted_shard2_replica2] o.a.s.c.ActionThrottle
Throttling leader attempts - waiting for 4965ms
2017-06-08 12:12:03.873 ERROR
(recoveryExecutor-3-thread-1-processing-n:192.168.0.202:8983_solr
x:gettingstarted_shard1_replica2 s:shard1 c:gettingstarted r:core_node3)
[c:gettingstarted s:shard1 r:core_node3 x:gettingstarted_shard1_replica2]
o.a.s.c.RecoveryStrategy Error while trying to recover.
core=gettingstarted_shard1_replica2:org.apache.solr.common.SolrException: No
registered leader was found after waiting for 4000ms , collection:
gettingstarted slice: shard1
at
org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:747)
at
org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:733)
at
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:305)
at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:221)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}
The main problem here is that the original replica is deleted before the new
replica fully recovers - the code should wait until this happens when there's
only one active replica left in the cluster. This should also consider a
scenario when there are several replicas of the same shard on the same node,
and again the code has to wait with deleting them before at least one new
replica has fully recovered.
> REPLACENODE can make the collection lost data which replicaFactor is 1
> -----------------------------------------------------------------------
>
> Key: SOLR-10704
> URL: https://issues.apache.org/jira/browse/SOLR-10704
> Project: Solr
> Issue Type: Bug
> Security Level: Public(Default Security Level. Issues are Public)
> Components: SolrCloud
> Affects Versions: 6.2
> Environment: Red Hat 4.8.3-9, JDK 1.8.0_121
> Reporter: Daisy.Yuan
> Assignee: Andrzej Bialecki
> Fix For: master (7.0), 6.7
>
> Attachments: 219.log
>
>
> When some replicas which the relative collection's replicaFactor is 1, it
> will lost data after executing the REPLACENODE cmd.
> It may be the new replica on the target node does not complete revovering,
> but the old replica on the source node was already be deleted.
> At last the target revocery failed for the following exception:
> 2017-05-18 17:08:48,587 | ERROR |
> recoveryExecutor-3-thread-2-processing-n:192.168.229.137:21103_solr
> x:replace-hdfs-coll1_shard1_replica2 s:shard1 c:replace-hdfs-coll1
> r:core_node3 | Error while trying to recover.
> core=replace-hdfs-coll1_shard1_replica2:java.lang.NullPointerException
> at org.apache.solr.update.PeerSync.alreadyInSync(PeerSync.java:339)
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]