[
https://issues.apache.org/jira/browse/SOLR-10704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Daisy.Yuan updated SOLR-10704:
------------------------------
Description:
When some replicas which the relative collection's replicaFactor is 1, it will
lost data after executing the REPLACENODE cmd.
It may be the new replica on the target node does not complete revovering, but
the old replica on the source node was already be deleted.
At last the target revocery failed for the following exception:
2017-05-18 17:08:48,587 | ERROR |
recoveryExecutor-3-thread-2-processing-n:192.168.229.137:21103_solr
x:replace-hdfs-coll1_shard1_replica2 s:shard1 c:replace-hdfs-coll1 r:core_node3
| Error while trying to recover.
core=replace-hdfs-coll1_shard1_replica2:java.lang.NullPointerException
at org.apache.solr.update.PeerSync.alreadyInSync(PeerSync.java:339)
was:
1. Collections' replicas distribution:
replace-hdfs-coll1 has two shards, each shard has one replica and the index
files was stored on hdfs.
replace-hdfs-coll1_shard1_replica1 on node 192.168.229.219
replace-hdfs-coll1_shard2_replica1 on node 192.168.228.193
replace-hdfs-coll2 has two shards, each shard has two replica and the index
files was stored on hdfs.
replace-hdfs-coll2_shard1_replica1 on node 192.168.229.219
replace-hdfs-coll2_shard1_replica2 on node 192.168.229.193
replace-hdfs-coll2_shard2_replica1 on node 192.168.228.193
replace-hdfs-coll2_shard2_replica2 on node 192.168.229.219
replace-local-coll1 has two shards, each shard has one replica and the index
files was stored on disk.
replace-local-coll1_shard1_replica1 on node 192.168.228.193
replace-local-coll1_shard2_replica1 on node 192.168.229.219
replace-local-coll2 has two shards, each shard has two replica and the index
files was stored on disk.
replace-local-coll2_shard1_replica1 on node 192.168.229.193
replace-local-coll2_shard1_replica2 on node 192.168.229.219
replace-local-coll2_shard2_replica1 on node 192.168.228.193
replace-local-coll2_shard2_replica2 on node 192.168.229.219
2. Execute REPLACENODE to replace node 192.168.229.219 with node 192.168.229.137
3. The REPLACENODE request was executed successfully
4. The target replace-hdfs-coll1_shard1_replica2 does not complete revovering,
but the source replace-hdfs-coll1_shard1_replica1 was already be deleted. At
last the target revocery failed for the following exception:
2017-05-18 17:08:48,587 | ERROR |
recoveryExecutor-3-thread-2-processing-n:192.168.229.137:21103_solr
x:replace-hdfs-coll1_shard1_replica2 s:shard1 c:replace-hdfs-coll1 r:core_node3
| Error while trying to recover.
core=replace-hdfs-coll1_shard1_replica2:java.lang.NullPointerException
at org.apache.solr.update.PeerSync.alreadyInSync(PeerSync.java:339)
5. The main process log messages
log in 193.log, the node 192.168.228.193 is overseer role.
step 1. node 192.168.229.193 recevied the REPLACENODE request
2017-05-18 17:08:32,717 | INFO | http-nio-21100-exec-6 | Invoked Collection
Action :replacenode with params
action=REPLACENODE&source=192.168.229.219:21100_solr&wt=json&target=192.168.229.137:21103_solr
and sendToOCPQueue=true |
org.apache.solr.handler.admin.CollectionsHandler.handleRequestBody(CollectionsHandler.java:203)
step 2. OverseerCollectionConfigSetProcessor get the task msg and process
REPLACENODE
step 3. add replica
2017-05-18 17:08:36,592 | INFO |
OverseerStateUpdate-1225069473835599708-192.168.228.193:21100_solr-n_0000000063
| processMessage: queueSize: 1, message = {
"core":"replace-hdfs-coll1_shard1_replica2",
"roles":null,
"base_url":"http://192.168.229.137:21103/solr",
"node_name":"192.168.229.137:21103_solr",
"state":"down",
"shard":"shard1",
"collection":"replace-hdfs-coll1",
"operation":"state"} current state version: 42 |
org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:221)
2017-05-18 17:08:40,540 | INFO |
OverseerStateUpdate-1225069473835599708-192.168.228.193:21100_solr-n_0000000063
| processMessage: queueSize: 1, message = {
"core":"replace-hdfs-coll1_shard1_replica2",
"core_node_name":"core_node3",
"dataDir":"hdfs://hacluster//user/solr//SolrServer1/replace-hdfs-coll1/core_node3/data/",
"roles":null,
"base_url":"http://192.168.229.137:21103/solr",
"node_name":"192.168.229.137:21103_solr",
"state":"recovering",
"shard":"shard1",
"collection":"replace-hdfs-coll1",
"operation":"state",
"ulogDir":"hdfs://hacluster/user/solr/SolrServer1/replace-hdfs-coll1/core_node3/data/tlog"}
current state version: 42 |
org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:221)
step 4. deletecore
2017-05-18 17:08:47,552 | INFO |
OverseerStateUpdate-1225069473835599708-192.168.228.193:21100_solr-n_0000000063
| processMessage: queueSize: 1, message = {
"operation":"deletecore",
"core":"replace-hdfs-coll1_shard1_replica1",
"node_name":"192.168.229.219:21100_solr",
"collection":"replace-hdfs-coll1",
"core_node_name":"core_node2"} current state version: 42 |
org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:221)
192.168.229.219 is the source node.
2017-05-18 17:08:47,484 | INFO | http-nio-21100-exec-6 | Removing directory
before core close:
hdfs://hacluster//user/solr//SolrServerAdmin/replace-hdfs-coll1/core_node2/data/index
|
org.apache.solr.core.CachingDirectoryFactory.closeCacheValue(CachingDirectoryFactory.java:271)
2017-05-18 17:08:47,515 | INFO | http-nio-21100-exec-6 | Removing directory
after core close:
hdfs://hacluster//user/solr//SolrServerAdmin/replace-hdfs-coll1/core_node2/data
|
org.apache.solr.core.CachingDirectoryFactory.close(CachingDirectoryFactory.java:204)
192.168.229.137 is the target node, but replace-hdfs-coll1_shard1_replica2
recovering is not finished
2017-05-18 17:08:48,547 | INFO |
recoveryExecutor-3-thread-2-processing-n:192.168.229.137:21103_solr
x:replace-hdfs-coll1_shard1_replica2 s:shard1 c:replace-hdfs-coll1 r:core_node3
| Attempting to PeerSync from
[http://192.168.229.219:21100/solr/replace-hdfs-coll1_shard1_replica1/] -
recoveringAfterStartup=[true] |
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:370)
2017-05-18 17:08:48,547 | INFO |
recoveryExecutor-3-thread-2-processing-n:192.168.229.137:21103_solr
x:replace-hdfs-coll1_shard1_replica2 s:shard1 c:replace-hdfs-coll1 r:core_node3
| PeerSync: core=replace-hdfs-coll1_shard1_replica2
url=http://192.168.229.137:21103/solr START
replicas=[http://192.168.229.219:21100/solr/replace-hdfs-coll1_shard1_replica1/]
nUpdates=100 | org.apache.solr.update.PeerSync.sync(PeerSync.java:214)
2017-05-18 17:08:48,587 | ERROR |
recoveryExecutor-3-thread-2-processing-n:192.168.229.137:21103_solr
x:replace-hdfs-coll1_shard1_replica2 s:shard1 c:replace-hdfs-coll1 r:core_node3
| Error while trying to recover.
core=replace-hdfs-coll1_shard1_replica2:java.lang.NullPointerException
at org.apache.solr.update.PeerSync.alreadyInSync(PeerSync.java:339)
at org.apache.solr.update.PeerSync.sync(PeerSync.java:222)
at
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:376)
at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:221)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
| org.apache.solr.common.SolrException.log(SolrException.java:159)
2017-05-18 17:08:48,587 | INFO |
recoveryExecutor-3-thread-2-processing-n:192.168.229.137:21103_solr
x:replace-hdfs-coll1_shard1_replica2 s:shard1 c:replace-hdfs-coll1 r:core_node3
| Replay not started, or was not successful... still buffering updates. |
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:441)
2017-05-18 17:08:48,587 | ERROR |
recoveryExecutor-3-thread-2-processing-n:192.168.229.137:21103_solr
x:replace-hdfs-coll1_shard1_replica2 s:shard1 c:replace-hdfs-coll1 r:core_node3
| Recovery failed - trying again... (0) |
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:478)
> REPLACENODE can make the collection with one replica lost data
> --------------------------------------------------------------
>
> Key: SOLR-10704
> URL: https://issues.apache.org/jira/browse/SOLR-10704
> Project: Solr
> Issue Type: Bug
> Security Level: Public(Default Security Level. Issues are Public)
> Components: SolrCloud
> Affects Versions: 6.2
> Environment: Red Hat 4.8.3-9, JDK 1.8.0_121
> Reporter: Daisy.Yuan
> Attachments: 219.log
>
>
> When some replicas which the relative collection's replicaFactor is 1, it
> will lost data after executing the REPLACENODE cmd.
> It may be the new replica on the target node does not complete revovering,
> but the old replica on the source node was already be deleted.
> At last the target revocery failed for the following exception:
> 2017-05-18 17:08:48,587 | ERROR |
> recoveryExecutor-3-thread-2-processing-n:192.168.229.137:21103_solr
> x:replace-hdfs-coll1_shard1_replica2 s:shard1 c:replace-hdfs-coll1
> r:core_node3 | Error while trying to recover.
> core=replace-hdfs-coll1_shard1_replica2:java.lang.NullPointerException
> at org.apache.solr.update.PeerSync.alreadyInSync(PeerSync.java:339)
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]