[ 
https://issues.apache.org/jira/browse/HDFS-14811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16923661#comment-16923661
 ] 

Chen Zhang edited comment on HDFS-14811 at 9/6/19 12:44 AM:
------------------------------------------------------------

Thanks [~ayushtkn] for your comments, I think I've found the root cause now.
 # The xceiver count number is not accurate.
 ## When bpServiceActor send heartbeat to NN, it use 
\{{DataNode#getXceiverCount()}} to get xceiver count
 ## {\{DataNode#getXceiverCount()}} actually use the 
\{{ThreadGroup#activeCount()}} as the xceiver count
 ## But threadGroup is not only related with DataXceiver threads, the 
DataXceiverServer and PacketResponder is also added to the same threadGroup.
 ## So if a DN sends heartbeat when it's receiving 1 block, the reported 
xceiver count would be 3. If DN is free, the reported xceiver count would be 1 
(only includes DataXceiverServer thread)
 ## Here is some tracing logs before sending heartbeat, I dumped all the treads 
information in threadGroup and the xcevierCount from 
\{{ThreadGroup#activeCount()}}:
 ## 
{code:java}
java.lang.ThreadGroup[name=dataXceiverServer,maxpri=10]
    
Thread[org.apache.hadoop.hdfs.server.datanode.DataXceiverServer@ea9b7c6,5,dataXceiverServer]
2019-09-06 00:41:49,182 [BP-664190069-192.168.1.3-1567701681845 heartbeating to 
localhost/127.0.0.1:56096] INFO  datanode.DataNode 
(BPServiceActor.java:sendHeartBeat(537)) - send
HeartBeat from 127.0.0.1:56120, xceiverCount: 1


java.lang.ThreadGroup[name=dataXceiverServer,maxpri=10]
   
Thread[org.apache.hadoop.hdfs.server.datanode.DataXceiverServer@3b220bcb,5,dataXceiverServer]
    Thread[DataXceiver for client DFSClient_NONMAPREDUCE_-2116461392_1 at 
/127.0.0.1:56218 [Receiving block 
BP-829366685-192.168.1.3-1567701683406:blk_1073741826_1002],5,dataXcei
verServer]
    Thread[PacketResponder: 
BP-829366685-192.168.1.3-1567701683406:blk_1073741826_1002, 
type=LAST_IN_PIPELINE,5,dataXceiverServer]
2019-09-06 00:41:49,182 [BP-664190069-192.168.1.3-1567701681845 heartbeating to 
localhost/127.0.0.1:56096] INFO  datanode.DataNode 
(BPServiceActor.java:sendHeartBeat(537)) - send
HeartBeat from 127.0.0.1:56108, xceiverCount: 3{code}

 # When allocating new blocks, the xceiver count in NN may not updated.
 ## In this test, {{{{createFile}}}} method sets replication factor to 1 for 
each file, so every file will write only 1 replica
 ## So the NN may receive heartbeat like this: 5 DN have 1 xceiver and 1 DN 
have 3 xceivers
 ## If we complete a common file and start another EC file, the NN may not 
receive the latest heartbeat, so the DN with 3 xceiver will be considered as 
overloaded(average is (5*1+3)/6*2=2.666...
 ## 
{code:java}
Datanode 127.0.0.1:56108 is not chosen since the node is too busy (load: 3 > 
2.6666666666666665){code}

If we want to work around, we can simply don't consider load, or trigger and 
wait for heartbeat before creating a new EC file, I think both is ok, what's 
your opinion?

Further more, should we fix the inaccurate problem of xceiver count? In the 
worst case, the reported xceiver count may be  the double of actual xceiver 
count (every xceiver processing op WRITE_BLOCK will create an extra 
PacketResponder thread) .


was (Author: zhangchen):
Thanks [~ayushtkn] for your comments, I think I've found the root cause now.
 # The xceiver count number is not accurate.
 ## When bpServiceActor send heartbeat to NN, it use 
\{{DataNode#getXceiverCount()}} to get xceiver count
 ## {\{DataNode#getXceiverCount()}} actually use the 
\{{ThreadGroup#activeCount()}} as the xceiver count
 ## But threadGroup is not only related with DataXceiver threads, the 
DataXceiverServer and PacketResponder is also added to the same threadGroup.
 ## So if a DN sends heartbeat when it's receiving 1 block, the reported 
xceiver count would be 3. If DN is free, the reported xceiver count would be 1 
(only includes DataXceiverServer thread)
 ## Here is some tracing logs before sending heartbeat, I dumped all the treads 
information in threadGroup and the xcevierCount from 
\{{ThreadGroup#activeCount()}}:
 ## 
{code:java}
java.lang.ThreadGroup[name=dataXceiverServer,maxpri=10]
    
Thread[org.apache.hadoop.hdfs.server.datanode.DataXceiverServer@ea9b7c6,5,dataXceiverServer]
2019-09-06 00:41:49,182 [BP-664190069-192.168.1.3-1567701681845 heartbeating to 
localhost/127.0.0.1:56096] INFO  datanode.DataNode 
(BPServiceActor.java:sendHeartBeat(537)) - send
HeartBeat from 127.0.0.1:56120, xceiverCount: 1


java.lang.ThreadGroup[name=dataXceiverServer,maxpri=10]
   
Thread[org.apache.hadoop.hdfs.server.datanode.DataXceiverServer@3b220bcb,5,dataXceiverServer]
    Thread[DataXceiver for client DFSClient_NONMAPREDUCE_-2116461392_1 at 
/127.0.0.1:56218 [Receiving block 
BP-829366685-192.168.1.3-1567701683406:blk_1073741826_1002],5,dataXcei
verServer]
    Thread[PacketResponder: 
BP-829366685-192.168.1.3-1567701683406:blk_1073741826_1002, 
type=LAST_IN_PIPELINE,5,dataXceiverServer]
2019-09-06 00:41:49,182 [BP-664190069-192.168.1.3-1567701681845 heartbeating to 
localhost/127.0.0.1:56096] INFO  datanode.DataNode 
(BPServiceActor.java:sendHeartBeat(537)) - send
HeartBeat from 127.0.0.1:56108, xceiverCount: 3{code}

 # When allocating new blocks, the xceiver count in NN may not updated.
 ## In this test, {{{{createFile}}}} method sets replication factor to 1 for 
each file, so every file will write only 1 replica
 ## So the NN may receive heartbeat like this: 5 DN have 1 xceiver and 1 DN 
have 3 xceivers
 ## If we complete a common file and start another EC file, the NN may not 
receive the latest heartbeat, so the DN with 3 xceiver will be considered as 
overloaded(average is (5*1+3)/6*2=2.666...
 ## 
{code:java}
Datanode 127.0.0.1:56108 is not chosen since the node is too busy (load: 3 > 
2.6666666666666665){code}

If we want to work around, we can simply don't consider load, or trigger and 
wait for heartbeat before creating a new EC file, I think both is ok, what's 
your opinion?

Further more, should we fix the inaccurate problem of xceiver count? In the 
worst case, the reported xceiver count may be  the double of actual xceiver 
count.

> RBF: TestRouterRpc#testErasureCoding is flaky
> ---------------------------------------------
>
>                 Key: HDFS-14811
>                 URL: https://issues.apache.org/jira/browse/HDFS-14811
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Chen Zhang
>            Assignee: Chen Zhang
>            Priority: Major
>         Attachments: HDFS-14811.001.patch
>
>
> The Failed reason:
> {code:java}
> 2019-09-01 18:19:20,940 [IPC Server handler 5 on default port 53140] INFO  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseRandom(838)) - [
> Node /default-rack/127.0.0.1:53148 [
> ]
> Node /default-rack/127.0.0.1:53161 [
> ]
> Node /default-rack/127.0.0.1:53157 [
>   Datanode 127.0.0.1:53157 is not chosen since the node is too busy (load: 3 
> > 2.6666666666666665).
> Node /default-rack/127.0.0.1:53143 [
> ]
> Node /default-rack/127.0.0.1:53165 [
> ]
> 2019-09-01 18:19:20,940 [IPC Server handler 5 on default port 53140] INFO  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseRandom(846)) - Not enough replicas 
> was chosen. Reason: {NODE_TOO_BUSY=1}
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseTarget(449)) - Failed to place enough 
> replicas, still in need of 1 to reach 6 (unavailableStorages=[], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) 
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> protocol.BlockStoragePolicy (BlockStoragePolicy.java:chooseStorageTypes(161)) 
> - Failed to place enough replicas: expected size is 1 but only 0 storage 
> types can be selected (replication=6, selected=[], unavailable=[DISK], 
> removed=[DISK], policy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]})
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseTarget(449)) - Failed to place enough 
> replicas, still in need of 1 to reach 6 (unavailableStorages=[DISK], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) All 
> required storage types are unavailable:  unavailableStorages=[DISK], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] INFO  
> ipc.Server (Server.java:logException(2982)) - IPC Server handler 5 on default 
> port 53140, call Call#1270 Retry#0 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 127.0.0.1:53202
> java.io.IOException: File /testec/testfile2 could only be written to 5 of the 
> 6 required nodes for RS-6-3-1024k. There are 6 datanode(s) running and 6 
> node(s) are excluded in this operation.
>       at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:2222)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:294)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2815)
>       at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:893)
>       at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:574)
>       at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>       at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:529)
>       at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
>       at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1001)
>       at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:929)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:422)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1891)
>       at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2921)
> 2019-09-01 18:19:20,942 [IPC Server handler 6 on default port 53197] INFO  
> ipc.Server (Server.java:logException(2975)) - IPC Server handler 6 on default 
> port 53197, call Call#1268 Retry#0 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 
> 192.168.1.112:53201: java.io.IOException: File /testec/testfile2 could only 
> be written to 5 of the 6 required nodes for RS-6-3-1024k. There are 6 
> datanode(s) running and 6 node(s) are excluded in this operation.
> {code}
> More discussion, see: 
> [HDFS-14654|https://issues.apache.org/jira/browse/HDFS-14654?focusedCommentId=16920439&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16920439]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to