[jira] [Comment Edited] (HDFS-14811) RBF: TestRouterRpc#testErasureCoding is flaky

Chen Zhang (Jira) Wed, 04 Sep 2019 19:38:20 -0700


    [ 
https://issues.apache.org/jira/browse/HDFS-14811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16922646#comment-16922646
 ]


Chen Zhang edited comment on HDFS-14811 at 9/5/19 2:37 AM:
-----------------------------------------------------------

>From the error log:
{quote}Datanode 127.0.0.1:53157 is not chosen since the node is too busy (load: 
3 > 2.6666666666666665)
{quote}
We can conduct that there is 8 xceiver running (2.6666/2*6 = 8), and 1 node 
have 3 xceivers and other 5 node have 5 xceiver in total. I think it's easy to 
happen when 2 clients are writing and 2 clients are reading, since the write 
target and read target is chosen randomly, it is possible that 2 read clients 
read on the same DN and that DN is writing a block at the same time.

In normal cluster, NN will chose other DN, but in this special case, there is 
no other choice when any DN overloaded. As I described above, we can't avoid 
some read clients concentrate on 1 DN, then we can't avoid the allocation 
failure if we consider load.

You are right, marking {{dfs.namenode.redundancy.considerLoad}} as false is the 
right solution.


was (Author: zhangchen):
>From the error log:
{quote}Datanode 127.0.0.1:53157 is not chosen since the node is too busy (load: 
3 > 2.6666666666666665)
{quote}
We can conduct that there is 8 xceiver running (2.6666/2*6 = 8), and 1 node 
have 3 xceivers and other 5 node have 5 xceiver in total. I think it's easy to 
happen when 2 client is writing and 2 client is reading, since the write target 
and read target is chosen randomly, it is possible that 2 read clients read on 
the same DN and that DN is writing a block at the same time.

In normal cluster, NN will chose other DN, but in this special case, there is 
no other choice when any DN overloaded. As I described above, we can't avoid 
some read clients concentrate on 1 DN, then we can't avoid the allocation 
failure if we consider load.

You are right, marking {{dfs.namenode.redundancy.considerLoad}} as false is the 
right solution.

> RBF: TestRouterRpc#testErasureCoding is flaky
> ---------------------------------------------
>
>                 Key: HDFS-14811
>                 URL: https://issues.apache.org/jira/browse/HDFS-14811
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Chen Zhang
>            Assignee: Chen Zhang
>            Priority: Major
>         Attachments: HDFS-14811.001.patch
>
>
> The Failed reason:
> {code:java}
> 2019-09-01 18:19:20,940 [IPC Server handler 5 on default port 53140] INFO  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseRandom(838)) - [
> Node /default-rack/127.0.0.1:53148 [
> ]
> Node /default-rack/127.0.0.1:53161 [
> ]
> Node /default-rack/127.0.0.1:53157 [
>   Datanode 127.0.0.1:53157 is not chosen since the node is too busy (load: 3 
> > 2.6666666666666665).
> Node /default-rack/127.0.0.1:53143 [
> ]
> Node /default-rack/127.0.0.1:53165 [
> ]
> 2019-09-01 18:19:20,940 [IPC Server handler 5 on default port 53140] INFO  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseRandom(846)) - Not enough replicas 
> was chosen. Reason: {NODE_TOO_BUSY=1}
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseTarget(449)) - Failed to place enough 
> replicas, still in need of 1 to reach 6 (unavailableStorages=[], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) 
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> protocol.BlockStoragePolicy (BlockStoragePolicy.java:chooseStorageTypes(161)) 
> - Failed to place enough replicas: expected size is 1 but only 0 storage 
> types can be selected (replication=6, selected=[], unavailable=[DISK], 
> removed=[DISK], policy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]})
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseTarget(449)) - Failed to place enough 
> replicas, still in need of 1 to reach 6 (unavailableStorages=[DISK], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) All 
> required storage types are unavailable:  unavailableStorages=[DISK], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] INFO  
> ipc.Server (Server.java:logException(2982)) - IPC Server handler 5 on default 
> port 53140, call Call#1270 Retry#0 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 127.0.0.1:53202
> java.io.IOException: File /testec/testfile2 could only be written to 5 of the 
> 6 required nodes for RS-6-3-1024k. There are 6 datanode(s) running and 6 
> node(s) are excluded in this operation.
>       at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:2222)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:294)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2815)
>       at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:893)
>       at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:574)
>       at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>       at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:529)
>       at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
>       at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1001)
>       at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:929)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:422)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1891)
>       at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2921)
> 2019-09-01 18:19:20,942 [IPC Server handler 6 on default port 53197] INFO  
> ipc.Server (Server.java:logException(2975)) - IPC Server handler 6 on default 
> port 53197, call Call#1268 Retry#0 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 
> 192.168.1.112:53201: java.io.IOException: File /testec/testfile2 could only 
> be written to 5 of the 6 required nodes for RS-6-3-1024k. There are 6 
> datanode(s) running and 6 node(s) are excluded in this operation.
> {code}
> More discussion, see: 
> [HDFS-14654|https://issues.apache.org/jira/browse/HDFS-14654?focusedCommentId=16920439&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16920439]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (HDFS-14811) RBF: TestRouterRpc#testErasureCoding is flaky

Reply via email to