[
https://issues.apache.org/jira/browse/HDFS-14811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16922646#comment-16922646
]
Chen Zhang edited comment on HDFS-14811 at 9/5/19 2:37 AM:
-----------------------------------------------------------
>From the error log:
{quote}Datanode 127.0.0.1:53157 is not chosen since the node is too busy (load:
3 > 2.6666666666666665)
{quote}
We can conduct that there is 8 xceiver running (2.6666/2*6 = 8), and 1 node
have 3 xceivers and other 5 node have 5 xceiver in total. I think it's easy to
happen when 2 clients are writing and 2 clients are reading, since the write
target and read target is chosen randomly, it is possible that 2 read clients
read on the same DN and that DN is writing a block at the same time.
In normal cluster, NN will chose other DN, but in this special case, there is
no other choice when any DN overloaded. As I described above, we can't avoid
some read clients concentrate on 1 DN, then we can't avoid the allocation
failure if we consider load.
You are right, marking {{dfs.namenode.redundancy.considerLoad}} as false is the
right solution.
was (Author: zhangchen):
>From the error log:
{quote}Datanode 127.0.0.1:53157 is not chosen since the node is too busy (load:
3 > 2.6666666666666665)
{quote}
We can conduct that there is 8 xceiver running (2.6666/2*6 = 8), and 1 node
have 3 xceivers and other 5 node have 5 xceiver in total. I think it's easy to
happen when 2 client is writing and 2 client is reading, since the write target
and read target is chosen randomly, it is possible that 2 read clients read on
the same DN and that DN is writing a block at the same time.
In normal cluster, NN will chose other DN, but in this special case, there is
no other choice when any DN overloaded. As I described above, we can't avoid
some read clients concentrate on 1 DN, then we can't avoid the allocation
failure if we consider load.
You are right, marking {{dfs.namenode.redundancy.considerLoad}} as false is the
right solution.
> RBF: TestRouterRpc#testErasureCoding is flaky
> ---------------------------------------------
>
> Key: HDFS-14811
> URL: https://issues.apache.org/jira/browse/HDFS-14811
> Project: Hadoop HDFS
> Issue Type: Bug
> Reporter: Chen Zhang
> Assignee: Chen Zhang
> Priority: Major
> Attachments: HDFS-14811.001.patch
>
>
> The Failed reason:
> {code:java}
> 2019-09-01 18:19:20,940 [IPC Server handler 5 on default port 53140] INFO
> blockmanagement.BlockPlacementPolicy
> (BlockPlacementPolicyDefault.java:chooseRandom(838)) - [
> Node /default-rack/127.0.0.1:53148 [
> ]
> Node /default-rack/127.0.0.1:53161 [
> ]
> Node /default-rack/127.0.0.1:53157 [
> Datanode 127.0.0.1:53157 is not chosen since the node is too busy (load: 3
> > 2.6666666666666665).
> Node /default-rack/127.0.0.1:53143 [
> ]
> Node /default-rack/127.0.0.1:53165 [
> ]
> 2019-09-01 18:19:20,940 [IPC Server handler 5 on default port 53140] INFO
> blockmanagement.BlockPlacementPolicy
> (BlockPlacementPolicyDefault.java:chooseRandom(846)) - Not enough replicas
> was chosen. Reason: {NODE_TOO_BUSY=1}
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN
> blockmanagement.BlockPlacementPolicy
> (BlockPlacementPolicyDefault.java:chooseTarget(449)) - Failed to place enough
> replicas, still in need of 1 to reach 6 (unavailableStorages=[],
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK],
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true)
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN
> protocol.BlockStoragePolicy (BlockStoragePolicy.java:chooseStorageTypes(161))
> - Failed to place enough replicas: expected size is 1 but only 0 storage
> types can be selected (replication=6, selected=[], unavailable=[DISK],
> removed=[DISK], policy=BlockStoragePolicy{HOT:7, storageTypes=[DISK],
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]})
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN
> blockmanagement.BlockPlacementPolicy
> (BlockPlacementPolicyDefault.java:chooseTarget(449)) - Failed to place enough
> replicas, still in need of 1 to reach 6 (unavailableStorages=[DISK],
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK],
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) All
> required storage types are unavailable: unavailableStorages=[DISK],
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK],
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] INFO
> ipc.Server (Server.java:logException(2982)) - IPC Server handler 5 on default
> port 53140, call Call#1270 Retry#0
> org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 127.0.0.1:53202
> java.io.IOException: File /testec/testfile2 could only be written to 5 of the
> 6 required nodes for RS-6-3-1024k. There are 6 datanode(s) running and 6
> node(s) are excluded in this operation.
> at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:2222)
> at
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:294)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2815)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:893)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:574)
> at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:529)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1001)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:929)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1891)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2921)
> 2019-09-01 18:19:20,942 [IPC Server handler 6 on default port 53197] INFO
> ipc.Server (Server.java:logException(2975)) - IPC Server handler 6 on default
> port 53197, call Call#1268 Retry#0
> org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from
> 192.168.1.112:53201: java.io.IOException: File /testec/testfile2 could only
> be written to 5 of the 6 required nodes for RS-6-3-1024k. There are 6
> datanode(s) running and 6 node(s) are excluded in this operation.
> {code}
> More discussion, see:
> [HDFS-14654|https://issues.apache.org/jira/browse/HDFS-14654?focusedCommentId=16920439&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16920439]
--
This message was sent by Atlassian Jira
(v8.3.2#803003)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]