[
https://issues.apache.org/jira/browse/HDFS-14654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16920438#comment-16920438
]
Chen Zhang edited comment on HDFS-14654 at 9/1/19 4:27 PM:
-----------------------------------------------------------
BTW, the test {{testErasureCoding}} happened to fail again on my machine, we've
encountered this failure in the penultimate build. The failure reason is some
node too busy when allocating block:
{code:java}
2019-09-01 18:19:20,940 [IPC Server handler 5 on default port 53140] INFO
blockmanagement.BlockPlacementPolicy
(BlockPlacementPolicyDefault.java:chooseRandom(838)) - [
Node /default-rack/127.0.0.1:53148 [
]
Node /default-rack/127.0.0.1:53161 [
]
Node /default-rack/127.0.0.1:53157 [
Datanode 127.0.0.1:53157 is not chosen since the node is too busy (load: 3 >
2.6666666666666665).
Node /default-rack/127.0.0.1:53143 [
]
Node /default-rack/127.0.0.1:53165 [
]
2019-09-01 18:19:20,940 [IPC Server handler 5 on default port 53140] INFO
blockmanagement.BlockPlacementPolicy
(BlockPlacementPolicyDefault.java:chooseRandom(846)) - Not enough replicas was
chosen. Reason: {NODE_TOO_BUSY=1}
2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN
blockmanagement.BlockPlacementPolicy
(BlockPlacementPolicyDefault.java:chooseTarget(449)) - Failed to place enough
replicas, still in need of 1 to reach 6 (unavailableStorages=[],
storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK],
creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true)
2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN
protocol.BlockStoragePolicy (BlockStoragePolicy.java:chooseStorageTypes(161)) -
Failed to place enough replicas: expected size is 1 but only 0 storage types
can be selected (replication=6, selected=[], unavailable=[DISK],
removed=[DISK], policy=BlockStoragePolicy{HOT:7, storageTypes=[DISK],
creationFallbacks=[], replicationFallbacks=[ARCHIVE]})
2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN
blockmanagement.BlockPlacementPolicy
(BlockPlacementPolicyDefault.java:chooseTarget(449)) - Failed to place enough
replicas, still in need of 1 to reach 6 (unavailableStorages=[DISK],
storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK],
creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) All
required storage types are unavailable: unavailableStorages=[DISK],
storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK],
creationFallbacks=[], replicationFallbacks=[ARCHIVE]}
2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] INFO
ipc.Server (Server.java:logException(2982)) - IPC Server handler 5 on default
port 53140, call Call#1270 Retry#0
org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 127.0.0.1:53202
java.io.IOException: File /testec/testfile2 could only be written to 5 of the 6
required nodes for RS-6-3-1024k. There are 6 datanode(s) running and 6 node(s)
are excluded in this operation.
at
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:2222)
at
org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:294)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2815)
at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:893)
at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:574)
at
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:529)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1001)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:929)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1891)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2921)
2019-09-01 18:19:20,942 [IPC Server handler 6 on default port 53197] INFO
ipc.Server (Server.java:logException(2975)) - IPC Server handler 6 on default
port 53197, call Call#1268 Retry#0
org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from
192.168.1.112:53201: java.io.IOException: File /testec/testfile2 could only be
written to 5 of the 6 required nodes for RS-6-3-1024k. There are 6 datanode(s)
running and 6 node(s) are excluded in this operation.
{code}
When we creating an EC file with the policy 6+3, it requires at least 6 block
succeed, but the num of DN is configured to 6 in the test cluster, so any
allocation failure(e.g. some other test write some file at the same time which
makes some DN overloaded) will cause the test failure. I think we can add more
DN to cluster(e.g. 9 in total) to reduce the possibility of allocation failure
happen.
But it's hard to repro this failure in local, so I can't verify if it really
works now. Should we change \{{NUM_DNS}} to 9 with this patch, or track it in
another Jira and commit it until we verified it works in some way?
was (Author: zhangchen):
BTW, the test {{testErasureCoding}} happened to fail again on my machine, we've
encountered this failure in the penultimate build. The failure reason is some
node too busy when allocating block:
{code:java}
019-09-01 18:19:20,940 [IPC Server handler 5 on default port 53140] INFO
blockmanagement.BlockPlacementPolicy
(BlockPlacementPolicyDefault.java:chooseRandom(838)) - [019-09-01 18:19:20,940
[IPC Server handler 5 on default port 53140] INFO
blockmanagement.BlockPlacementPolicy
(BlockPlacementPolicyDefault.java:chooseRandom(838)) - [Node
/default-rack/127.0.0.1:53148 []Node /default-rack/127.0.0.1:53161 []Node
/default-rack/127.0.0.1:53157 [ Datanode 127.0.0.1:53157 is not chosen since
the node is too busy (load: 3 > 2.6666666666666665).Node
/default-rack/127.0.0.1:53143 []Node /default-rack/127.0.0.1:53165 []2019-09-01
18:19:20,940 [IPC Server handler 5 on default port 53140] INFO
blockmanagement.BlockPlacementPolicy
(BlockPlacementPolicyDefault.java:chooseRandom(846)) - Not enough replicas was
chosen. Reason: {NODE_TOO_BUSY=1}2019-09-01 18:19:20,941 [IPC Server handler 5
on default port 53140] WARN blockmanagement.BlockPlacementPolicy
(BlockPlacementPolicyDefault.java:chooseTarget(449)) - Failed to place enough
replicas, still in need of 1 to reach 6 (unavailableStorages=[],
storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK],
creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true)
2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN
protocol.BlockStoragePolicy (BlockStoragePolicy.java:chooseStorageTypes(161)) -
Failed to place enough replicas: expected size is 1 but only 0 storage types
can be selected (replication=6, selected=[], unavailable=[DISK],
removed=[DISK], policy=BlockStoragePolicy{HOT:7, storageTypes=[DISK],
creationFallbacks=[], replicationFallbacks=[ARCHIVE]})2019-09-01 18:19:20,941
[IPC Server handler 5 on default port 53140] WARN
blockmanagement.BlockPlacementPolicy
(BlockPlacementPolicyDefault.java:chooseTarget(449)) - Failed to place enough
replicas, still in need of 1 to reach 6 (unavailableStorages=[DISK],
storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK],
creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) All
required storage types are unavailable: unavailableStorages=[DISK],
storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK],
creationFallbacks=[], replicationFallbacks=[ARCHIVE]}2019-09-01 18:19:20,941
[IPC Server handler 5 on default port 53140] INFO ipc.Server
(Server.java:logException(2982)) - IPC Server handler 5 on default port 53140,
call Call#1270 Retry#0 org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock
from 127.0.0.1:53202java.io.IOException: File /testec/testfile2 could only be
written to 5 of the 6 required nodes for RS-6-3-1024k. There are 6 datanode(s)
running and 6 node(s) are excluded in this operation. at
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:2222)
at
org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:294)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2815)
at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:893)
at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:574)
at
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:529)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) at
org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1001) at
org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:929) at
java.security.AccessController.doPrivileged(Native Method) at
javax.security.auth.Subject.doAs(Subject.java:422) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1891)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2921)2019-09-01
18:19:20,942 [IPC Server handler 6 on default port 53197] INFO ipc.Server
(Server.java:logException(2975)) - IPC Server handler 6 on default port 53197,
call Call#1268 Retry#0 org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock
from 192.168.1.112:53201: java.io.IOException: File /testec/testfile2 could
only be written to 5 of the 6 required nodes for RS-6-3-1024k. There are 6
datanode(s) running and 6 node(s) are excluded in this operation.
{code}
When we creating an EC file with the policy 6+3, it requires at least 6 block
succeed, but the num of DN is configured to 6 in the test cluster, so any
allocation failure(e.g. some other test write some file at the same time which
makes some DN overloaded) will cause the test failure. I think we can add more
DN to cluster(e.g. 9 in total) to reduce the possibility of allocation failure
happen.
But it's hard to repro this failure in local, so I can't verify if it really
works now. Should we change \{{NUM_DNS}} to 9 with this patch, or track it in
another Jira and commit it until we verified it works in some way?
> RBF: TestRouterRpc#testNamenodeMetrics is flaky
> -----------------------------------------------
>
> Key: HDFS-14654
> URL: https://issues.apache.org/jira/browse/HDFS-14654
> Project: Hadoop HDFS
> Issue Type: Sub-task
> Reporter: Takanobu Asanuma
> Assignee: Chen Zhang
> Priority: Major
> Attachments: HDFS-14654.001.patch, HDFS-14654.002.patch,
> HDFS-14654.003.patch, HDFS-14654.004.patch, HDFS-14654.005.patch, error.log
>
>
> They sometimes pass and sometimes fail.
--
This message was sent by Atlassian Jira
(v8.3.2#803003)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]