[
https://issues.apache.org/jira/browse/HDDS-6445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mark Gui updated HDDS-6445:
---------------------------
Description:
I hit a problem we doing the following test:
17 DNs, ockg -p test -n 10 -s $((4*1024*1024*1024)) -t 10, shutdown 3 DNs one
by one.
client trace:
{code:java}
java.io.IOException: Allocated 0 blocks. Requested 1 blocks at
org.apache.hadoop.ozone.client.io.ECKeyOutputStream.write(ECKeyOutputStream.java:175)
at
org.apache.hadoop.ozone.client.io.OzoneOutputStream.write(OzoneOutputStream.java:50)
at
org.apache.hadoop.ozone.freon.ContentGenerator.write(ContentGenerator.java:76)
at
org.apache.hadoop.ozone.freon.OzoneClientKeyGenerator.lambda$createKey$36(OzoneClientKeyGenerator.java:145)
at com.codahale.metrics.Timer.time(Timer.java:101) at
org.apache.hadoop.ozone.freon.OzoneClientKeyGenerator.createKey(OzoneClientKeyGenerator.java:142)
at
org.apache.hadoop.ozone.freon.BaseFreonGenerator.tryNextTask(BaseFreonGenerator.java:183)
at
org.apache.hadoop.ozone.freon.BaseFreonGenerator.taskLoop(BaseFreonGenerator.java:163)
at
org.apache.hadoop.ozone.freon.BaseFreonGenerator.lambda$startTaskRunners$1(BaseFreonGenerator.java:146)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748) {code}
SCM trace:
{code:java}
2022-03-11 09:16:33,562 [IPC Server handler 74 on default port 9863] ERROR
org.apache.hadoop.hdds.scm.pipeline.WritableECContainerProvider: Unable to
allocate a container for EC/ECReplicationConfig{data=10, parity=4,
ecChunkSize=1048576, codec=rs} after trying all existing
containersorg.apache.hadoop.hdds.scm.exceptions.SCMException: No enough
datanodes to choose. TotalNode = 15 RequiredNode = 14 ExcludedNode = 2
at
org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackScatter.chooseDatanodes(SCMContainerPlacementRackScatter.java:105)
at
org.apache.hadoop.hdds.scm.pipeline.ECPipelineProvider.create(ECPipelineProvider.java:74)
at
org.apache.hadoop.hdds.scm.pipeline.ECPipelineProvider.create(ECPipelineProvider.java:40)
at
org.apache.hadoop.hdds.scm.pipeline.PipelineFactory.create(PipelineFactory.java:90)
at
org.apache.hadoop.hdds.scm.pipeline.PipelineManagerImpl.createPipeline(PipelineManagerImpl.java:180)
at
org.apache.hadoop.hdds.scm.pipeline.WritableECContainerProvider.allocateContainer(WritableECContainerProvider.java:168)
at
org.apache.hadoop.hdds.scm.pipeline.WritableECContainerProvider.getContainer(WritableECContainerProvider.java:151)
at
org.apache.hadoop.hdds.scm.pipeline.WritableECContainerProvider.getContainer(WritableECContainerProvider.java:51)
at
org.apache.hadoop.hdds.scm.pipeline.WritableContainerFactory.getContainer(WritableContainerFactory.java:59)
at
org.apache.hadoop.hdds.scm.block.BlockManagerImpl.allocateBlock(BlockManagerImpl.java:176)
at
org.apache.hadoop.hdds.scm.server.SCMBlockProtocolServer.allocateBlock(SCMBlockProtocolServer.java:194)
at
org.apache.hadoop.hdds.scm.protocol.ScmBlockLocationProtocolServerSideTranslatorPB.allocateScmBlock(ScmBlockLocationProtocolServerSideTranslatorPB.java:180)
at
org.apache.hadoop.hdds.scm.protocol.ScmBlockLocationProtocolServerSideTranslatorPB.processMessage(ScmBlockLocationProtocolServerSideTranslatorPB.java:130)
at
org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
at
org.apache.hadoop.hdds.scm.protocol.ScmBlockLocationProtocolServerSideTranslatorPB.send(ScmBlockLocationProtocolServerSideTranslatorPB.java:112)
at
org.apache.hadoop.hdds.protocol.proto.ScmBlockLocationProtocolProtos$ScmBlockLocationProtocolService$2.callBlockingMethod(ScmBlockLocationProtocolProtos.java:14202)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server.processCall(ProtobufRpcEngine.java:466)
at
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:574)
at
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:552)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1093) at
org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1035) at
org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:963) at
java.security.AccessController.doPrivileged(Native Method) at
javax.security.auth.Subject.doAs(Subject.java:422) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2966) {code}
The problems is the line:
{code:java}
containersorg.apache.hadoop.hdds.scm.exceptions.SCMException: No enough
datanodes to choose. TotalNode = 15 RequiredNode = 14 ExcludedNode = 2 {code}
Actually I only shutdown 3 out of 17 DNs, so there should be 14 DNs left that
should be enough for EC 10+4.
Here we see that we have 15 DNs(right after a kill operation, so SCM didn't get
stale events), are 2 excluded DNs, so intuitively, there are not enough DNs.
But remember that we only killed 3 DNs, we should have enough DNs left.
So the problem is that one of the excluded DN is marked stale/dead(I mean the
second DN that killed, not the last one), so this one is not included in the 15
DNs shown, so we can't just simply do 15 - 2 = 13 < 14, and throw.
> EC: Fix allocateBlock failure due to inaccurate excludedNodes check.
> --------------------------------------------------------------------
>
> Key: HDDS-6445
> URL: https://issues.apache.org/jira/browse/HDDS-6445
> Project: Apache Ozone
> Issue Type: Sub-task
> Reporter: Mark Gui
> Assignee: Mark Gui
> Priority: Major
>
> I hit a problem we doing the following test:
> 17 DNs, ockg -p test -n 10 -s $((4*1024*1024*1024)) -t 10, shutdown 3 DNs one
> by one.
> client trace:
> {code:java}
> java.io.IOException: Allocated 0 blocks. Requested 1 blocks at
> org.apache.hadoop.ozone.client.io.ECKeyOutputStream.write(ECKeyOutputStream.java:175)
> at
> org.apache.hadoop.ozone.client.io.OzoneOutputStream.write(OzoneOutputStream.java:50)
> at
> org.apache.hadoop.ozone.freon.ContentGenerator.write(ContentGenerator.java:76)
> at
> org.apache.hadoop.ozone.freon.OzoneClientKeyGenerator.lambda$createKey$36(OzoneClientKeyGenerator.java:145)
> at com.codahale.metrics.Timer.time(Timer.java:101) at
> org.apache.hadoop.ozone.freon.OzoneClientKeyGenerator.createKey(OzoneClientKeyGenerator.java:142)
> at
> org.apache.hadoop.ozone.freon.BaseFreonGenerator.tryNextTask(BaseFreonGenerator.java:183)
> at
> org.apache.hadoop.ozone.freon.BaseFreonGenerator.taskLoop(BaseFreonGenerator.java:163)
> at
> org.apache.hadoop.ozone.freon.BaseFreonGenerator.lambda$startTaskRunners$1(BaseFreonGenerator.java:146)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748) {code}
> SCM trace:
> {code:java}
> 2022-03-11 09:16:33,562 [IPC Server handler 74 on default port 9863] ERROR
> org.apache.hadoop.hdds.scm.pipeline.WritableECContainerProvider: Unable to
> allocate a container for EC/ECReplicationConfig{data=10, parity=4,
> ecChunkSize=1048576, codec=rs} after trying all existing
> containersorg.apache.hadoop.hdds.scm.exceptions.SCMException: No enough
> datanodes to choose. TotalNode = 15 RequiredNode = 14 ExcludedNode = 2
> at
> org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackScatter.chooseDatanodes(SCMContainerPlacementRackScatter.java:105)
> at
> org.apache.hadoop.hdds.scm.pipeline.ECPipelineProvider.create(ECPipelineProvider.java:74)
> at
> org.apache.hadoop.hdds.scm.pipeline.ECPipelineProvider.create(ECPipelineProvider.java:40)
> at
> org.apache.hadoop.hdds.scm.pipeline.PipelineFactory.create(PipelineFactory.java:90)
> at
> org.apache.hadoop.hdds.scm.pipeline.PipelineManagerImpl.createPipeline(PipelineManagerImpl.java:180)
> at
> org.apache.hadoop.hdds.scm.pipeline.WritableECContainerProvider.allocateContainer(WritableECContainerProvider.java:168)
> at
> org.apache.hadoop.hdds.scm.pipeline.WritableECContainerProvider.getContainer(WritableECContainerProvider.java:151)
> at
> org.apache.hadoop.hdds.scm.pipeline.WritableECContainerProvider.getContainer(WritableECContainerProvider.java:51)
> at
> org.apache.hadoop.hdds.scm.pipeline.WritableContainerFactory.getContainer(WritableContainerFactory.java:59)
> at
> org.apache.hadoop.hdds.scm.block.BlockManagerImpl.allocateBlock(BlockManagerImpl.java:176)
> at
> org.apache.hadoop.hdds.scm.server.SCMBlockProtocolServer.allocateBlock(SCMBlockProtocolServer.java:194)
> at
> org.apache.hadoop.hdds.scm.protocol.ScmBlockLocationProtocolServerSideTranslatorPB.allocateScmBlock(ScmBlockLocationProtocolServerSideTranslatorPB.java:180)
> at
> org.apache.hadoop.hdds.scm.protocol.ScmBlockLocationProtocolServerSideTranslatorPB.processMessage(ScmBlockLocationProtocolServerSideTranslatorPB.java:130)
> at
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
> at
> org.apache.hadoop.hdds.scm.protocol.ScmBlockLocationProtocolServerSideTranslatorPB.send(ScmBlockLocationProtocolServerSideTranslatorPB.java:112)
> at
> org.apache.hadoop.hdds.protocol.proto.ScmBlockLocationProtocolProtos$ScmBlockLocationProtocolService$2.callBlockingMethod(ScmBlockLocationProtocolProtos.java:14202)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server.processCall(ProtobufRpcEngine.java:466)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:574)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:552)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1093) at
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1035) at
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:963) at
> java.security.AccessController.doPrivileged(Native Method) at
> javax.security.auth.Subject.doAs(Subject.java:422) at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2966) {code}
> The problems is the line:
> {code:java}
> containersorg.apache.hadoop.hdds.scm.exceptions.SCMException: No enough
> datanodes to choose. TotalNode = 15 RequiredNode = 14 ExcludedNode = 2 {code}
> Actually I only shutdown 3 out of 17 DNs, so there should be 14 DNs left that
> should be enough for EC 10+4.
> Here we see that we have 15 DNs(right after a kill operation, so SCM didn't
> get stale events), are 2 excluded DNs, so intuitively, there are not enough
> DNs. But remember that we only killed 3 DNs, we should have enough DNs left.
> So the problem is that one of the excluded DN is marked stale/dead(I mean the
> second DN that killed, not the last one), so this one is not included in the
> 15 DNs shown, so we can't just simply do 15 - 2 = 13 < 14, and throw.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]