[jira] [Updated] (HDFS-12567) BlockPlacementPolicyRackFaultTolerant fails with racks with very few nodes
[ https://issues.apache.org/jira/browse/HDFS-12567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Wang updated HDFS-12567: --- Resolution: Fixed Fix Version/s: 3.0.0 Status: Resolved (was: Patch Available) Committed to trunk and branch-3.0, thanks for the review Eddy! > BlockPlacementPolicyRackFaultTolerant fails with racks with very few nodes > -- > > Key: HDFS-12567 > URL: https://issues.apache.org/jira/browse/HDFS-12567 > Project: Hadoop HDFS > Issue Type: Bug > Components: erasure-coding >Affects Versions: 3.0.0-alpha1 >Reporter: Andrew Wang >Assignee: Andrew Wang > Labels: hdfs-ec-3.0-must-do > Fix For: 3.0.0 > > Attachments: HDFS-12567.001.patch, HDFS-12567.002.patch, > HDFS-12567.003.patch, HDFS-12567.repro.patch > > > Found this while doing some testing on an internal cluster with an unusual > setup. We have a rack with ~20 nodes, then a few more with just a few nodes. > It would fail to get (# data blocks) datanodes even though there were plenty > of DNs on the rack with 20 DNs. > I managed to reproduce this same issue in a unit test, stack trace like this: > {noformat} > java.io.IOException: File /testfile0 could only be written to 5 of the 6 > required nodes for RS-6-3-1024k. There are 9 datanode(s) running and no > node(s) are excluded in this operation. > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:2083) > at > org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:286) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2609) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:863) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:548) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675) > {noformat} > This isn't a very critical bug since it's an unusual rack configuration, but > it can easily happen during testing. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-12567) BlockPlacementPolicyRackFaultTolerant fails with racks with very few nodes
[ https://issues.apache.org/jira/browse/HDFS-12567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Wang updated HDFS-12567: --- Attachment: HDFS-12567.003.patch Fix checkstyles. > BlockPlacementPolicyRackFaultTolerant fails with racks with very few nodes > -- > > Key: HDFS-12567 > URL: https://issues.apache.org/jira/browse/HDFS-12567 > Project: Hadoop HDFS > Issue Type: Bug > Components: erasure-coding >Affects Versions: 3.0.0-alpha1 >Reporter: Andrew Wang >Assignee: Andrew Wang > Labels: hdfs-ec-3.0-must-do > Attachments: HDFS-12567.001.patch, HDFS-12567.002.patch, > HDFS-12567.003.patch, HDFS-12567.repro.patch > > > Found this while doing some testing on an internal cluster with an unusual > setup. We have a rack with ~20 nodes, then a few more with just a few nodes. > It would fail to get (# data blocks) datanodes even though there were plenty > of DNs on the rack with 20 DNs. > I managed to reproduce this same issue in a unit test, stack trace like this: > {noformat} > java.io.IOException: File /testfile0 could only be written to 5 of the 6 > required nodes for RS-6-3-1024k. There are 9 datanode(s) running and no > node(s) are excluded in this operation. > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:2083) > at > org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:286) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2609) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:863) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:548) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675) > {noformat} > This isn't a very critical bug since it's an unusual rack configuration, but > it can easily happen during testing. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-12567) BlockPlacementPolicyRackFaultTolerant fails with racks with very few nodes
[ https://issues.apache.org/jira/browse/HDFS-12567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Wang updated HDFS-12567: --- Attachment: HDFS-12567.002.patch Good point, thanks for reviewing Eddy! I had an idea earlier about parameterizing this test to have different # nodes/racks, but found that this single test was able to reproduce the issue. > BlockPlacementPolicyRackFaultTolerant fails with racks with very few nodes > -- > > Key: HDFS-12567 > URL: https://issues.apache.org/jira/browse/HDFS-12567 > Project: Hadoop HDFS > Issue Type: Bug > Components: erasure-coding >Affects Versions: 3.0.0-alpha1 >Reporter: Andrew Wang >Assignee: Andrew Wang > Labels: hdfs-ec-3.0-must-do > Attachments: HDFS-12567.001.patch, HDFS-12567.002.patch, > HDFS-12567.repro.patch > > > Found this while doing some testing on an internal cluster with an unusual > setup. We have a rack with ~20 nodes, then a few more with just a few nodes. > It would fail to get (# data blocks) datanodes even though there were plenty > of DNs on the rack with 20 DNs. > I managed to reproduce this same issue in a unit test, stack trace like this: > {noformat} > java.io.IOException: File /testfile0 could only be written to 5 of the 6 > required nodes for RS-6-3-1024k. There are 9 datanode(s) running and no > node(s) are excluded in this operation. > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:2083) > at > org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:286) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2609) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:863) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:548) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675) > {noformat} > This isn't a very critical bug since it's an unusual rack configuration, but > it can easily happen during testing. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-12567) BlockPlacementPolicyRackFaultTolerant fails with racks with very few nodes
[ https://issues.apache.org/jira/browse/HDFS-12567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Wang updated HDFS-12567: --- Status: Patch Available (was: Open) > BlockPlacementPolicyRackFaultTolerant fails with racks with very few nodes > -- > > Key: HDFS-12567 > URL: https://issues.apache.org/jira/browse/HDFS-12567 > Project: Hadoop HDFS > Issue Type: Bug > Components: erasure-coding >Affects Versions: 3.0.0-alpha1 >Reporter: Andrew Wang >Assignee: Andrew Wang > Labels: hdfs-ec-3.0-must-do > Attachments: HDFS-12567.001.patch, HDFS-12567.repro.patch > > > Found this while doing some testing on an internal cluster with an unusual > setup. We have a rack with ~20 nodes, then a few more with just a few nodes. > It would fail to get (# data blocks) datanodes even though there were plenty > of DNs on the rack with 20 DNs. > I managed to reproduce this same issue in a unit test, stack trace like this: > {noformat} > java.io.IOException: File /testfile0 could only be written to 5 of the 6 > required nodes for RS-6-3-1024k. There are 9 datanode(s) running and no > node(s) are excluded in this operation. > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:2083) > at > org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:286) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2609) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:863) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:548) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675) > {noformat} > This isn't a very critical bug since it's an unusual rack configuration, but > it can easily happen during testing. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-12567) BlockPlacementPolicyRackFaultTolerant fails with racks with very few nodes
[ https://issues.apache.org/jira/browse/HDFS-12567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Wang updated HDFS-12567: --- Attachment: HDFS-12567.001.patch Patch attached. This basically wraps the current logic with a fallback that removes the maxNodesPerRack limit when we fail to place enough racks. I used the earlier repro patch as the unit test, which now passes. > BlockPlacementPolicyRackFaultTolerant fails with racks with very few nodes > -- > > Key: HDFS-12567 > URL: https://issues.apache.org/jira/browse/HDFS-12567 > Project: Hadoop HDFS > Issue Type: Bug > Components: erasure-coding >Affects Versions: 3.0.0-alpha1 >Reporter: Andrew Wang >Assignee: Andrew Wang > Labels: hdfs-ec-3.0-must-do > Attachments: HDFS-12567.001.patch, HDFS-12567.repro.patch > > > Found this while doing some testing on an internal cluster with an unusual > setup. We have a rack with ~20 nodes, then a few more with just a few nodes. > It would fail to get (# data blocks) datanodes even though there were plenty > of DNs on the rack with 20 DNs. > I managed to reproduce this same issue in a unit test, stack trace like this: > {noformat} > java.io.IOException: File /testfile0 could only be written to 5 of the 6 > required nodes for RS-6-3-1024k. There are 9 datanode(s) running and no > node(s) are excluded in this operation. > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:2083) > at > org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:286) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2609) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:863) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:548) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675) > {noformat} > This isn't a very critical bug since it's an unusual rack configuration, but > it can easily happen during testing. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-12567) BlockPlacementPolicyRackFaultTolerant fails with racks with very few nodes
[ https://issues.apache.org/jira/browse/HDFS-12567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Wang updated HDFS-12567: --- Attachment: HDFS-12567.repro.patch Here's a unit test I wrote that reliably reproduces this issue. Haven't been able to dig into the root cause yet. > BlockPlacementPolicyRackFaultTolerant fails with racks with very few nodes > -- > > Key: HDFS-12567 > URL: https://issues.apache.org/jira/browse/HDFS-12567 > Project: Hadoop HDFS > Issue Type: Bug > Components: erasure-coding >Affects Versions: 3.0.0-alpha1 >Reporter: Andrew Wang > Labels: hdfs-ec-3.0-must-do > Attachments: HDFS-12567.repro.patch > > > Found this while doing some testing on an internal cluster with an unusual > setup. We have a rack with ~20 nodes, then a few more with just a few nodes. > It would fail to get (# data blocks) datanodes even though there were plenty > of DNs on the rack with 20 DNs. > I managed to reproduce this same issue in a unit test, stack trace like this: > {noformat} > java.io.IOException: File /testfile0 could only be written to 5 of the 6 > required nodes for RS-6-3-1024k. There are 9 datanode(s) running and no > node(s) are excluded in this operation. > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:2083) > at > org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:286) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2609) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:863) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:548) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675) > {noformat} > This isn't a very critical bug since it's an unusual rack configuration, but > it can easily happen during testing. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org