[jira] [Comment Edited] (HDFS-14654) RBF: TestRouterRpc#testNamenodeMetrics is flaky

Chen Zhang (Jira) Sun, 01 Sep 2019 09:28:39 -0700


    [ 
https://issues.apache.org/jira/browse/HDFS-14654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16920438#comment-16920438
 ]


Chen Zhang edited comment on HDFS-14654 at 9/1/19 4:27 PM:
-----------------------------------------------------------

BTW, the test {{testErasureCoding}} happened to fail again on my machine, we've 
encountered this failure in the penultimate build. The failure reason is some 
node too busy when allocating block:
{code:java}
2019-09-01 18:19:20,940 [IPC Server handler 5 on default port 53140] INFO  
blockmanagement.BlockPlacementPolicy 
(BlockPlacementPolicyDefault.java:chooseRandom(838)) - [
Node /default-rack/127.0.0.1:53148 [
]
Node /default-rack/127.0.0.1:53161 [
]
Node /default-rack/127.0.0.1:53157 [
  Datanode 127.0.0.1:53157 is not chosen since the node is too busy (load: 3 > 
2.6666666666666665).
Node /default-rack/127.0.0.1:53143 [
]
Node /default-rack/127.0.0.1:53165 [
]
2019-09-01 18:19:20,940 [IPC Server handler 5 on default port 53140] INFO  
blockmanagement.BlockPlacementPolicy 
(BlockPlacementPolicyDefault.java:chooseRandom(846)) - Not enough replicas was 
chosen. Reason: {NODE_TOO_BUSY=1}
2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
blockmanagement.BlockPlacementPolicy 
(BlockPlacementPolicyDefault.java:chooseTarget(449)) - Failed to place enough 
replicas, still in need of 1 to reach 6 (unavailableStorages=[], 
storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) 
2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
protocol.BlockStoragePolicy (BlockStoragePolicy.java:chooseStorageTypes(161)) - 
Failed to place enough replicas: expected size is 1 but only 0 storage types 
can be selected (replication=6, selected=[], unavailable=[DISK], 
removed=[DISK], policy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
creationFallbacks=[], replicationFallbacks=[ARCHIVE]})
2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
blockmanagement.BlockPlacementPolicy 
(BlockPlacementPolicyDefault.java:chooseTarget(449)) - Failed to place enough 
replicas, still in need of 1 to reach 6 (unavailableStorages=[DISK], 
storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) All 
required storage types are unavailable:  unavailableStorages=[DISK], 
storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
creationFallbacks=[], replicationFallbacks=[ARCHIVE]}
2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] INFO  
ipc.Server (Server.java:logException(2982)) - IPC Server handler 5 on default 
port 53140, call Call#1270 Retry#0 
org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 127.0.0.1:53202
java.io.IOException: File /testec/testfile2 could only be written to 5 of the 6 
required nodes for RS-6-3-1024k. There are 6 datanode(s) running and 6 node(s) 
are excluded in this operation.
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:2222)
        at 
org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:294)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2815)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:893)
        at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:574)
        at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:529)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1001)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:929)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1891)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2921)
2019-09-01 18:19:20,942 [IPC Server handler 6 on default port 53197] INFO  
ipc.Server (Server.java:logException(2975)) - IPC Server handler 6 on default 
port 53197, call Call#1268 Retry#0 
org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 
192.168.1.112:53201: java.io.IOException: File /testec/testfile2 could only be 
written to 5 of the 6 required nodes for RS-6-3-1024k. There are 6 datanode(s) 
running and 6 node(s) are excluded in this operation.
{code}
When we creating an EC file with the policy 6+3, it requires at least 6 block 
succeed, but the num of DN is configured to 6 in the test cluster, so any 
allocation failure(e.g. some other test write some file at the same time which 
makes some DN overloaded) will cause the test failure. I think we can add more 
DN to cluster(e.g. 9 in total) to reduce the possibility of allocation failure 
happen.

But it's hard to repro this failure in local, so I can't verify if it really 
works now. Should we change \{{NUM_DNS}} to 9 with this patch, or track it in 
another Jira and commit it until we verified it works in some way?


was (Author: zhangchen):
BTW, the test {{testErasureCoding}} happened to fail again on my machine, we've 
encountered this failure in the penultimate build. The failure reason is some 
node too busy when allocating block:
{code:java}
019-09-01 18:19:20,940 [IPC Server handler 5 on default port 53140] INFO  
blockmanagement.BlockPlacementPolicy 
(BlockPlacementPolicyDefault.java:chooseRandom(838)) - [019-09-01 18:19:20,940 
[IPC Server handler 5 on default port 53140] INFO  
blockmanagement.BlockPlacementPolicy 
(BlockPlacementPolicyDefault.java:chooseRandom(838)) - [Node 
/default-rack/127.0.0.1:53148 []Node /default-rack/127.0.0.1:53161 []Node 
/default-rack/127.0.0.1:53157 [  Datanode 127.0.0.1:53157 is not chosen since 
the node is too busy (load: 3 > 2.6666666666666665).Node 
/default-rack/127.0.0.1:53143 []Node /default-rack/127.0.0.1:53165 []2019-09-01 
18:19:20,940 [IPC Server handler 5 on default port 53140] INFO  
blockmanagement.BlockPlacementPolicy 
(BlockPlacementPolicyDefault.java:chooseRandom(846)) - Not enough replicas was 
chosen. Reason: {NODE_TOO_BUSY=1}2019-09-01 18:19:20,941 [IPC Server handler 5 
on default port 53140] WARN  blockmanagement.BlockPlacementPolicy 
(BlockPlacementPolicyDefault.java:chooseTarget(449)) - Failed to place enough 
replicas, still in need of 1 to reach 6 (unavailableStorages=[], 
storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) 
2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
protocol.BlockStoragePolicy (BlockStoragePolicy.java:chooseStorageTypes(161)) - 
Failed to place enough replicas: expected size is 1 but only 0 storage types 
can be selected (replication=6, selected=[], unavailable=[DISK], 
removed=[DISK], policy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
creationFallbacks=[], replicationFallbacks=[ARCHIVE]})2019-09-01 18:19:20,941 
[IPC Server handler 5 on default port 53140] WARN  
blockmanagement.BlockPlacementPolicy 
(BlockPlacementPolicyDefault.java:chooseTarget(449)) - Failed to place enough 
replicas, still in need of 1 to reach 6 (unavailableStorages=[DISK], 
storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) All 
required storage types are unavailable:  unavailableStorages=[DISK], 
storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
creationFallbacks=[], replicationFallbacks=[ARCHIVE]}2019-09-01 18:19:20,941 
[IPC Server handler 5 on default port 53140] INFO  ipc.Server 
(Server.java:logException(2982)) - IPC Server handler 5 on default port 53140, 
call Call#1270 Retry#0 org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock 
from 127.0.0.1:53202java.io.IOException: File /testec/testfile2 could only be 
written to 5 of the 6 required nodes for RS-6-3-1024k. There are 6 datanode(s) 
running and 6 node(s) are excluded in this operation. at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:2222)
 at 
org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:294)
 at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2815)
 at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:893)
 at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:574)
 at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
 at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:529)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) at 
org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1001) at 
org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:929) at 
java.security.AccessController.doPrivileged(Native Method) at 
javax.security.auth.Subject.doAs(Subject.java:422) at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1891)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2921)2019-09-01 
18:19:20,942 [IPC Server handler 6 on default port 53197] INFO  ipc.Server 
(Server.java:logException(2975)) - IPC Server handler 6 on default port 53197, 
call Call#1268 Retry#0 org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock 
from 192.168.1.112:53201: java.io.IOException: File /testec/testfile2 could 
only be written to 5 of the 6 required nodes for RS-6-3-1024k. There are 6 
datanode(s) running and 6 node(s) are excluded in this operation.
{code}
When we creating an EC file with the policy 6+3, it requires at least 6 block 
succeed, but the num of DN is configured to 6 in the test cluster, so any 
allocation failure(e.g. some other test write some file at the same time which 
makes some DN overloaded) will cause the test failure. I think we can add more 
DN to cluster(e.g. 9 in total) to reduce the possibility of allocation failure 
happen.

But it's hard to repro this failure in local, so I can't verify if it really 
works now. Should we change \{{NUM_DNS}} to 9 with this patch, or track it in 
another Jira and commit it until we verified it works in some way?

> RBF: TestRouterRpc#testNamenodeMetrics is flaky
> -----------------------------------------------
>
>                 Key: HDFS-14654
>                 URL: https://issues.apache.org/jira/browse/HDFS-14654
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>            Reporter: Takanobu Asanuma
>            Assignee: Chen Zhang
>            Priority: Major
>         Attachments: HDFS-14654.001.patch, HDFS-14654.002.patch, 
> HDFS-14654.003.patch, HDFS-14654.004.patch, HDFS-14654.005.patch, error.log
>
>
> They sometimes pass and sometimes fail.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (HDFS-14654) RBF: TestRouterRpc#testNamenodeMetrics is flaky

Reply via email to