[jira] [Updated] (SOLR-10704) REPLACENODE can make the collection with one replica lost data

Daisy.Yuan (JIRA) Sat, 20 May 2017 01:35:30 -0700

     [ 
https://issues.apache.org/jira/browse/SOLR-10704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Daisy.Yuan updated SOLR-10704:
------------------------------
    Description: 
When some replicas which the relative collection's replicaFactor is 1, it will 
lost data after executing the REPLACENODE cmd. 

It may be the new replica on the target node does not complete revovering, but 
the old replica on the source node  was already be deleted.

At last the target revocery failed for the following exception:
2017-05-18 17:08:48,587 | ERROR | 
recoveryExecutor-3-thread-2-processing-n:192.168.229.137:21103_solr 
x:replace-hdfs-coll1_shard1_replica2 s:shard1 c:replace-hdfs-coll1 r:core_node3 
| Error while trying to recover. 
core=replace-hdfs-coll1_shard1_replica2:java.lang.NullPointerException
        at org.apache.solr.update.PeerSync.alreadyInSync(PeerSync.java:339)




  was:
1. Collections' replicas distribution: 
replace-hdfs-coll1 has two shards, each shard has one replica and the index 
files was stored on hdfs. 
replace-hdfs-coll1_shard1_replica1  on node 192.168.229.219
replace-hdfs-coll1_shard2_replica1  on node 192.168.228.193

replace-hdfs-coll2 has two shards, each shard has two replica and the index 
files was stored on hdfs.
replace-hdfs-coll2_shard1_replica1  on node 192.168.229.219
replace-hdfs-coll2_shard1_replica2  on node 192.168.229.193

replace-hdfs-coll2_shard2_replica1  on node 192.168.228.193
replace-hdfs-coll2_shard2_replica2  on node 192.168.229.219

replace-local-coll1 has two shards, each shard has one replica and the index 
files was stored on disk.
replace-local-coll1_shard1_replica1  on node 192.168.228.193
replace-local-coll1_shard2_replica1  on node 192.168.229.219

replace-local-coll2 has two shards, each shard has two replica and the index 
files was stored on disk.
replace-local-coll2_shard1_replica1  on node 192.168.229.193
replace-local-coll2_shard1_replica2 on node 192.168.229.219

replace-local-coll2_shard2_replica1  on node 192.168.228.193
replace-local-coll2_shard2_replica2  on node 192.168.229.219

2. Execute REPLACENODE to replace node 192.168.229.219 with node 192.168.229.137

3. The REPLACENODE request was executed successfully

4. The target replace-hdfs-coll1_shard1_replica2 does not complete revovering, 
but the source replace-hdfs-coll1_shard1_replica1 was already be deleted. At 
last the target revocery failed for the following exception:
2017-05-18 17:08:48,587 | ERROR | 
recoveryExecutor-3-thread-2-processing-n:192.168.229.137:21103_solr 
x:replace-hdfs-coll1_shard1_replica2 s:shard1 c:replace-hdfs-coll1 r:core_node3 
| Error while trying to recover. 
core=replace-hdfs-coll1_shard1_replica2:java.lang.NullPointerException
        at org.apache.solr.update.PeerSync.alreadyInSync(PeerSync.java:339)

5. The main process log messages
log in 193.log, the node 192.168.228.193 is overseer role.
step 1. node 192.168.229.193 recevied the REPLACENODE request
2017-05-18 17:08:32,717 | INFO  | http-nio-21100-exec-6 | Invoked Collection 
Action :replacenode with params 
action=REPLACENODE&source=192.168.229.219:21100_solr&wt=json&target=192.168.229.137:21103_solr
 and sendToOCPQueue=true | 
org.apache.solr.handler.admin.CollectionsHandler.handleRequestBody(CollectionsHandler.java:203)

step 2. OverseerCollectionConfigSetProcessor get the task msg and process 
REPLACENODE

step 3.  add replica
2017-05-18 17:08:36,592 | INFO  | 
OverseerStateUpdate-1225069473835599708-192.168.228.193:21100_solr-n_0000000063 
| processMessage: queueSize: 1, message = {
  "core":"replace-hdfs-coll1_shard1_replica2",
  "roles":null,
  "base_url":"http://192.168.229.137:21103/solr";,
  "node_name":"192.168.229.137:21103_solr",
  "state":"down",
  "shard":"shard1",
  "collection":"replace-hdfs-coll1",
  "operation":"state"} current state version: 42 | 
org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:221)
  
 2017-05-18 17:08:40,540 | INFO  | 
OverseerStateUpdate-1225069473835599708-192.168.228.193:21100_solr-n_0000000063 
| processMessage: queueSize: 1, message = {
  "core":"replace-hdfs-coll1_shard1_replica2",
  "core_node_name":"core_node3",
  
"dataDir":"hdfs://hacluster//user/solr//SolrServer1/replace-hdfs-coll1/core_node3/data/",
  "roles":null,
  "base_url":"http://192.168.229.137:21103/solr";,
  "node_name":"192.168.229.137:21103_solr",
  "state":"recovering",
  "shard":"shard1",
  "collection":"replace-hdfs-coll1",
  "operation":"state",
  
"ulogDir":"hdfs://hacluster/user/solr/SolrServer1/replace-hdfs-coll1/core_node3/data/tlog"}
 current state version: 42 | 
org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:221)
 step 4.  deletecore
2017-05-18 17:08:47,552 | INFO  | 
OverseerStateUpdate-1225069473835599708-192.168.228.193:21100_solr-n_0000000063 
| processMessage: queueSize: 1, message = {
  "operation":"deletecore",
  "core":"replace-hdfs-coll1_shard1_replica1",
  "node_name":"192.168.229.219:21100_solr",
  "collection":"replace-hdfs-coll1",
  "core_node_name":"core_node2"} current state version: 42 | 
org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:221)
  
 192.168.229.219 is the source node.
  2017-05-18 17:08:47,484 | INFO  | http-nio-21100-exec-6 | Removing directory 
before core close: 
hdfs://hacluster//user/solr//SolrServerAdmin/replace-hdfs-coll1/core_node2/data/index
 | 
org.apache.solr.core.CachingDirectoryFactory.closeCacheValue(CachingDirectoryFactory.java:271)
2017-05-18 17:08:47,515 | INFO  | http-nio-21100-exec-6 | Removing directory 
after core close: 
hdfs://hacluster//user/solr//SolrServerAdmin/replace-hdfs-coll1/core_node2/data 
| 
org.apache.solr.core.CachingDirectoryFactory.close(CachingDirectoryFactory.java:204)

 192.168.229.137 is the target node, but  replace-hdfs-coll1_shard1_replica2 
recovering is not finished
 2017-05-18 17:08:48,547 | INFO  | 
recoveryExecutor-3-thread-2-processing-n:192.168.229.137:21103_solr 
x:replace-hdfs-coll1_shard1_replica2 s:shard1 c:replace-hdfs-coll1 r:core_node3 
| Attempting to PeerSync from 
[http://192.168.229.219:21100/solr/replace-hdfs-coll1_shard1_replica1/] - 
recoveringAfterStartup=[true] | 
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:370)
2017-05-18 17:08:48,547 | INFO  | 
recoveryExecutor-3-thread-2-processing-n:192.168.229.137:21103_solr 
x:replace-hdfs-coll1_shard1_replica2 s:shard1 c:replace-hdfs-coll1 r:core_node3 
| PeerSync: core=replace-hdfs-coll1_shard1_replica2 
url=http://192.168.229.137:21103/solr START 
replicas=[http://192.168.229.219:21100/solr/replace-hdfs-coll1_shard1_replica1/]
 nUpdates=100 | org.apache.solr.update.PeerSync.sync(PeerSync.java:214)
2017-05-18 17:08:48,587 | ERROR | 
recoveryExecutor-3-thread-2-processing-n:192.168.229.137:21103_solr 
x:replace-hdfs-coll1_shard1_replica2 s:shard1 c:replace-hdfs-coll1 r:core_node3 
| Error while trying to recover. 
core=replace-hdfs-coll1_shard1_replica2:java.lang.NullPointerException
        at org.apache.solr.update.PeerSync.alreadyInSync(PeerSync.java:339)
        at org.apache.solr.update.PeerSync.sync(PeerSync.java:222)
        at 
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:376)
        at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:221)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
 | org.apache.solr.common.SolrException.log(SolrException.java:159)

2017-05-18 17:08:48,587 | INFO  | 
recoveryExecutor-3-thread-2-processing-n:192.168.229.137:21103_solr 
x:replace-hdfs-coll1_shard1_replica2 s:shard1 c:replace-hdfs-coll1 r:core_node3 
| Replay not started, or was not successful... still buffering updates. | 
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:441)
2017-05-18 17:08:48,587 | ERROR | 
recoveryExecutor-3-thread-2-processing-n:192.168.229.137:21103_solr 
x:replace-hdfs-coll1_shard1_replica2 s:shard1 c:replace-hdfs-coll1 r:core_node3 
| Recovery failed - trying again... (0) | 
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:478)




> REPLACENODE can make the collection with one replica lost data
> --------------------------------------------------------------
>
>                 Key: SOLR-10704
>                 URL: https://issues.apache.org/jira/browse/SOLR-10704
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrCloud
>    Affects Versions: 6.2
>         Environment: Red Hat 4.8.3-9, JDK 1.8.0_121
>            Reporter: Daisy.Yuan
>         Attachments: 219.log
>
>
> When some replicas which the relative collection's replicaFactor is 1, it 
> will lost data after executing the REPLACENODE cmd. 
> It may be the new replica on the target node does not complete revovering, 
> but the old replica on the source node  was already be deleted.
> At last the target revocery failed for the following exception:
> 2017-05-18 17:08:48,587 | ERROR | 
> recoveryExecutor-3-thread-2-processing-n:192.168.229.137:21103_solr 
> x:replace-hdfs-coll1_shard1_replica2 s:shard1 c:replace-hdfs-coll1 
> r:core_node3 | Error while trying to recover. 
> core=replace-hdfs-coll1_shard1_replica2:java.lang.NullPointerException
>         at org.apache.solr.update.PeerSync.alreadyInSync(PeerSync.java:339)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SOLR-10704) REPLACENODE can make the collection with one replica lost data

Reply via email to