Nilotpal Nandi created HDDS-3317:
------------------------------------
Summary: replication manager failing to re-replicate under
replicated containers
Key: HDDS-3317
URL: https://issues.apache.org/jira/browse/HDDS-3317
Project: Hadoop Distributed Data Store
Issue Type: Bug
Reporter: Nilotpal Nandi
properties set :
------------------
"ozone.scm.stale.node.interval": "2m",
"ozone.scm.dead.node.interval": "4m",
"hdds.scm.replication.thread.interval": "12s",
"ozone.scm.container.size": "1GB"
Steps taken :
-----------------
1) write a key (less than a block size)
2) shutdown two container replica datanodes.
3) Tried to query container info on the container
The entries of datanodes which were shutdown was gone after sometime, when
query command was run.
But the container, which is now under-replicated, was never re-replicated even
after waiting more than 25 minutes.
{noformat}
ozone scmcli container info 34 | egrep 'Container|Datanodes' Wed Apr 1 11:20:09
UTC 2020 Container id: 34 Container State: CLOSED Container Path:
/hadoop-ozone/datanode/data/hdds/48873cab-51ef-44f9-8995-86a377038c78/current/containerDir0/34/metadata
Container Metadata: Datanodes: [quasar-fjgcwr-7.quasar-fjgcwr.root.hwx.site]
{noformat}
It seems pipeline close command is failing.
scm log snippet :
------------------
{noformat}
2020-04-01 11:02:40,757 ERROR
org.apache.hadoop.hdds.scm.pipeline.PipelineActionHandler: Could not execute
pipeline action=action: CLOSE
closePipeline {
pipelineID {
id: "7eea320c-7d56-45f4-bc39-a3f492b0c74e"
}
reason: PIPELINE_FAILED
detailedReason: "ea2322d9-8ede-4f48-a72d-693e809d2b95 has not seen follower/s
92f73ec3-9ed8-41c8-9103-c4c1b2b365e1 for 304999ms"
}
pipeline=PipelineID=7eea320c-7d56-45f4-bc39-a3f492b0c74e {}
org.apache.hadoop.hdds.scm.pipeline.PipelineNotFoundException:
PipelineID=7eea320c-7d56-45f4-bc39-a3f492b0c74e not found
at
org.apache.hadoop.hdds.scm.pipeline.PipelineStateMap.getPipeline(PipelineStateMap.java:133)
at
org.apache.hadoop.hdds.scm.pipeline.PipelineStateManager.getPipeline(PipelineStateManager.java:63)
at
org.apache.hadoop.hdds.scm.pipeline.SCMPipelineManager.getPipeline(SCMPipelineManager.java:243)
at
org.apache.hadoop.hdds.scm.pipeline.PipelineActionHandler.onMessage(PipelineActionHandler.java:59)
at
org.apache.hadoop.hdds.scm.pipeline.PipelineActionHandler.onMessage(PipelineActionHandler.java:35)
at
org.apache.hadoop.hdds.server.events.SingleThreadExecutor.lambda$onMessage$1(SingleThreadExecutor.java:81)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
2020-04-01 11:02:40,758 ERROR
org.apache.hadoop.hdds.scm.pipeline.PipelineActionHandler: Could not execute
pipeline action=action: CLOSE
closePipeline {
pipelineID {
id: "7eea320c-7d56-45f4-bc39-a3f492b0c74e"
}
reason: PIPELINE_FAILED
detailedReason: "ea2322d9-8ede-4f48-a72d-693e809d2b95 has not seen follower/s
92f73ec3-9ed8-41c8-9103-c4c1b2b365e1 for 305000ms"
}
pipeline=PipelineID=7eea320c-7d56-45f4-bc39-a3f492b0c74e {}
org.apache.hadoop.hdds.scm.pipeline.PipelineNotFoundException:
PipelineID=7eea320c-7d56-45f4-bc39-a3f492b0c74e not found
at
org.apache.hadoop.hdds.scm.pipeline.PipelineStateMap.getPipeline(PipelineStateMap.java:133)
at
org.apache.hadoop.hdds.scm.pipeline.PipelineStateManager.getPipeline(PipelineStateManager.java:63)
at
org.apache.hadoop.hdds.scm.pipeline.SCMPipelineManager.getPipeline(SCMPipelineManager.java:243)
at
org.apache.hadoop.hdds.scm.pipeline.PipelineActionHandler.onMessage(PipelineActionHandler.java:59)
at
org.apache.hadoop.hdds.scm.pipeline.PipelineActionHandler.onMessage(PipelineActionHandler.java:35)
at
org.apache.hadoop.hdds.server.events.SingleThreadExecutor.lambda$onMessage$1(SingleThreadExecutor.java:81)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
2020-04-01 11:02:40,758 ERROR
org.apache.hadoop.hdds.scm.pipeline.PipelineActionHandler: Could not execute
pipeline action=action: CLOSE
closePipeline {
pipelineID {
id: "7eea320c-7d56-45f4-bc39-a3f492b0c74e"
}
reason: PIPELINE_FAILED
detailedReason: "ea2322d9-8ede-4f48-a72d-693e809d2b95 has not seen follower/s
92f73ec3-9ed8-41c8-9103-c4c1b2b365e1 for 307501ms"
}
pipeline=PipelineID=7eea320c-7d56-45f4-bc39-a3f492b0c74e {}
org.apache.hadoop.hdds.scm.pipeline.PipelineNotFoundException:
PipelineID=7eea320c-7d56-45f4-bc39-a3f492b0c74e not found
at
org.apache.hadoop.hdds.scm.pipeline.PipelineStateMap.getPipeline(PipelineStateMap.java:133)
at
org.apache.hadoop.hdds.scm.pipeline.PipelineStateManager.getPipeline(PipelineStateManager.java:63)
at
org.apache.hadoop.hdds.scm.pipeline.SCMPipelineManager.getPipeline(SCMPipelineManager.java:243)
at
org.apache.hadoop.hdds.scm.pipeline.PipelineActionHandler.onMessage(PipelineActionHandler.java:59)
at
org.apache.hadoop.hdds.scm.pipeline.PipelineActionHandler.onMessage(PipelineActionHandler.java:35)
at
org.apache.hadoop.hdds.server.events.SingleThreadExecutor.lambda$onMessage$1(SingleThreadExecutor.java:81)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834){noformat}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]