[
https://issues.apache.org/jira/browse/SOLR-10704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16018369#comment-16018369
]
Daisy.Yuan commented on SOLR-10704:
-----------------------------------
The details:
1. Collections' replicas distribution:
replace-hdfs-coll1 has two shards, each shard has one replica and the index
files was stored on hdfs.
replace-hdfs-coll1_shard1_replica1 on node 192.168.229.219
replace-hdfs-coll1_shard2_replica1 on node 192.168.228.193
replace-hdfs-coll2 has two shards, each shard has two replica and the index
files was stored on hdfs.
replace-hdfs-coll2_shard1_replica1 on node 192.168.229.219
replace-hdfs-coll2_shard1_replica2 on node 192.168.229.193
replace-hdfs-coll2_shard2_replica1 on node 192.168.228.193
replace-hdfs-coll2_shard2_replica2 on node 192.168.229.219
replace-local-coll1 has two shards, each shard has one replica and the index
files was stored on disk.
replace-local-coll1_shard1_replica1 on node 192.168.228.193
replace-local-coll1_shard2_replica1 on node 192.168.229.219
replace-local-coll2 has two shards, each shard has two replica and the index
files was stored on disk.
replace-local-coll2_shard1_replica1 on node 192.168.229.193
replace-local-coll2_shard1_replica2 on node 192.168.229.219
replace-local-coll2_shard2_replica1 on node 192.168.228.193
replace-local-coll2_shard2_replica2 on node 192.168.229.219
2. Execute REPLACENODE to replace node 192.168.229.219 with node 192.168.229.137
3. The REPLACENODE request was executed successfully
4. The target replace-hdfs-coll1_shard1_replica2 does not complete revovering,
but the source replace-hdfs-coll1_shard1_replica1 was already be deleted. At
last the target revocery failed for the following exception:
2017-05-18 17:08:48,587 | ERROR |
recoveryExecutor-3-thread-2-processing-n:192.168.229.137:21103_solr
x:replace-hdfs-coll1_shard1_replica2 s:shard1 c:replace-hdfs-coll1 r:core_node3
| Error while trying to recover.
core=replace-hdfs-coll1_shard1_replica2:java.lang.NullPointerException
at org.apache.solr.update.PeerSync.alreadyInSync(PeerSync.java:339)
5. The main process log messages
log in 193.log, the node 192.168.228.193 is overseer role.
step 1. node 192.168.229.193 recevied the REPLACENODE request
2017-05-18 17:08:32,717 | INFO | http-nio-21100-exec-6 | Invoked Collection
Action :replacenode with params
action=REPLACENODE&source=192.168.229.219:21100_solr&wt=json&target=192.168.229.137:21103_solr
and sendToOCPQueue=true |
org.apache.solr.handler.admin.CollectionsHandler.handleRequestBody(CollectionsHandler.java:203)
step 2. OverseerCollectionConfigSetProcessor get the task msg and process
REPLACENODE
step 3. add replica
2017-05-18 17:08:36,592 | INFO |
OverseerStateUpdate-1225069473835599708-192.168.228.193:21100_solr-n_0000000063
| processMessage: queueSize: 1, message = {
"core":"replace-hdfs-coll1_shard1_replica2",
"roles":null,
"base_url":"http://192.168.229.137:21103/solr",
"node_name":"192.168.229.137:21103_solr",
"state":"down",
"shard":"shard1",
"collection":"replace-hdfs-coll1",
"operation":"state"} current state version: 42 |
org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:221)
2017-05-18 17:08:40,540 | INFO |
OverseerStateUpdate-1225069473835599708-192.168.228.193:21100_solr-n_0000000063
| processMessage: queueSize: 1, message = {
"core":"replace-hdfs-coll1_shard1_replica2",
"core_node_name":"core_node3",
"dataDir":"hdfs://hacluster//user/solr//SolrServer1/replace-hdfs-coll1/core_node3/data/",
"roles":null,
"base_url":"http://192.168.229.137:21103/solr",
"node_name":"192.168.229.137:21103_solr",
"state":"recovering",
"shard":"shard1",
"collection":"replace-hdfs-coll1",
"operation":"state",
"ulogDir":"hdfs://hacluster/user/solr/SolrServer1/replace-hdfs-coll1/core_node3/data/tlog"}
current state version: 42 |
org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:221)
step 4. deletecore
2017-05-18 17:08:47,552 | INFO |
OverseerStateUpdate-1225069473835599708-192.168.228.193:21100_solr-n_0000000063
| processMessage: queueSize: 1, message = {
"operation":"deletecore",
"core":"replace-hdfs-coll1_shard1_replica1",
"node_name":"192.168.229.219:21100_solr",
"collection":"replace-hdfs-coll1",
"core_node_name":"core_node2"} current state version: 42 |
org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:221)
192.168.229.219 is the source node.
2017-05-18 17:08:47,484 | INFO | http-nio-21100-exec-6 | Removing directory
before core close:
hdfs://hacluster//user/solr//SolrServerAdmin/replace-hdfs-coll1/core_node2/data/index
|
org.apache.solr.core.CachingDirectoryFactory.closeCacheValue(CachingDirectoryFactory.java:271)
2017-05-18 17:08:47,515 | INFO | http-nio-21100-exec-6 | Removing directory
after core close:
hdfs://hacluster//user/solr//SolrServerAdmin/replace-hdfs-coll1/core_node2/data
|
org.apache.solr.core.CachingDirectoryFactory.close(CachingDirectoryFactory.java:204)
192.168.229.137 is the target node, but replace-hdfs-coll1_shard1_replica2
recovering is not finished
2017-05-18 17:08:48,547 | INFO |
recoveryExecutor-3-thread-2-processing-n:192.168.229.137:21103_solr
x:replace-hdfs-coll1_shard1_replica2 s:shard1 c:replace-hdfs-coll1 r:core_node3
| Attempting to PeerSync from
[http://192.168.229.219:21100/solr/replace-hdfs-coll1_shard1_replica1/] -
recoveringAfterStartup=[true] |
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:370)
2017-05-18 17:08:48,547 | INFO |
recoveryExecutor-3-thread-2-processing-n:192.168.229.137:21103_solr
x:replace-hdfs-coll1_shard1_replica2 s:shard1 c:replace-hdfs-coll1 r:core_node3
| PeerSync: core=replace-hdfs-coll1_shard1_replica2
url=http://192.168.229.137:21103/solr START
replicas=[http://192.168.229.219:21100/solr/replace-hdfs-coll1_shard1_replica1/]
nUpdates=100 | org.apache.solr.update.PeerSync.sync(PeerSync.java:214)
2017-05-18 17:08:48,587 | ERROR |
recoveryExecutor-3-thread-2-processing-n:192.168.229.137:21103_solr
x:replace-hdfs-coll1_shard1_replica2 s:shard1 c:replace-hdfs-coll1 r:core_node3
| Error while trying to recover.
core=replace-hdfs-coll1_shard1_replica2:java.lang.NullPointerException
at org.apache.solr.update.PeerSync.alreadyInSync(PeerSync.java:339)
at org.apache.solr.update.PeerSync.sync(PeerSync.java:222)
at
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:376)
at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:221)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
| org.apache.solr.common.SolrException.log(SolrException.java:159)
2017-05-18 17:08:48,587 | INFO |
recoveryExecutor-3-thread-2-processing-n:192.168.229.137:21103_solr
x:replace-hdfs-coll1_shard1_replica2 s:shard1 c:replace-hdfs-coll1 r:core_node3
| Replay not started, or was not successful... still buffering updates. |
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:441)
2017-05-18 17:08:48,587 | ERROR |
recoveryExecutor-3-thread-2-processing-n:192.168.229.137:21103_solr
x:replace-hdfs-coll1_shard1_replica2 s:shard1 c:replace-hdfs-coll1 r:core_node3
| Recovery failed - trying again... (0) |
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:478)
> REPLACENODE can make the collection with one replica lost data
> --------------------------------------------------------------
>
> Key: SOLR-10704
> URL: https://issues.apache.org/jira/browse/SOLR-10704
> Project: Solr
> Issue Type: Bug
> Security Level: Public(Default Security Level. Issues are Public)
> Components: SolrCloud
> Affects Versions: 6.2
> Environment: Red Hat 4.8.3-9, JDK 1.8.0_121
> Reporter: Daisy.Yuan
> Attachments: 219.log
>
>
> When some replicas which the relative collection's replicaFactor is 1, it
> will lost data after executing the REPLACENODE cmd.
> It may be the new replica on the target node does not complete revovering,
> but the old replica on the source node was already be deleted.
> At last the target revocery failed for the following exception:
> 2017-05-18 17:08:48,587 | ERROR |
> recoveryExecutor-3-thread-2-processing-n:192.168.229.137:21103_solr
> x:replace-hdfs-coll1_shard1_replica2 s:shard1 c:replace-hdfs-coll1
> r:core_node3 | Error while trying to recover.
> core=replace-hdfs-coll1_shard1_replica2:java.lang.NullPointerException
> at org.apache.solr.update.PeerSync.alreadyInSync(PeerSync.java:339)
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]