[jira] [Updated] (HDFS-12567) BlockPlacementPolicyRackFaultTolerant fails with racks with very few nodes

2017-10-05 Thread Andrew Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Wang updated HDFS-12567:
---
   Resolution: Fixed
Fix Version/s: 3.0.0
   Status: Resolved  (was: Patch Available)

Committed to trunk and branch-3.0, thanks for the review Eddy!

> BlockPlacementPolicyRackFaultTolerant fails with racks with very few nodes
> --
>
> Key: HDFS-12567
> URL: https://issues.apache.org/jira/browse/HDFS-12567
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: erasure-coding
>Affects Versions: 3.0.0-alpha1
>Reporter: Andrew Wang
>Assignee: Andrew Wang
>  Labels: hdfs-ec-3.0-must-do
> Fix For: 3.0.0
>
> Attachments: HDFS-12567.001.patch, HDFS-12567.002.patch, 
> HDFS-12567.003.patch, HDFS-12567.repro.patch
>
>
> Found this while doing some testing on an internal cluster with an unusual 
> setup. We have a rack with ~20 nodes, then a few more with just a few nodes. 
> It would fail to get (# data blocks) datanodes even though there were plenty 
> of DNs on the rack with 20 DNs.
> I managed to reproduce this same issue in a unit test, stack trace like this:
> {noformat}
> java.io.IOException: File /testfile0 could only be written to 5 of the 6 
> required nodes for RS-6-3-1024k. There are 9 datanode(s) running and no 
> node(s) are excluded in this operation.
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:2083)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:286)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2609)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:863)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:548)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675)
> {noformat}
> This isn't a very critical bug since it's an unusual rack configuration, but 
> it can easily happen during testing.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12567) BlockPlacementPolicyRackFaultTolerant fails with racks with very few nodes

2017-10-05 Thread Andrew Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Wang updated HDFS-12567:
---
Attachment: HDFS-12567.003.patch

Fix checkstyles.

> BlockPlacementPolicyRackFaultTolerant fails with racks with very few nodes
> --
>
> Key: HDFS-12567
> URL: https://issues.apache.org/jira/browse/HDFS-12567
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: erasure-coding
>Affects Versions: 3.0.0-alpha1
>Reporter: Andrew Wang
>Assignee: Andrew Wang
>  Labels: hdfs-ec-3.0-must-do
> Attachments: HDFS-12567.001.patch, HDFS-12567.002.patch, 
> HDFS-12567.003.patch, HDFS-12567.repro.patch
>
>
> Found this while doing some testing on an internal cluster with an unusual 
> setup. We have a rack with ~20 nodes, then a few more with just a few nodes. 
> It would fail to get (# data blocks) datanodes even though there were plenty 
> of DNs on the rack with 20 DNs.
> I managed to reproduce this same issue in a unit test, stack trace like this:
> {noformat}
> java.io.IOException: File /testfile0 could only be written to 5 of the 6 
> required nodes for RS-6-3-1024k. There are 9 datanode(s) running and no 
> node(s) are excluded in this operation.
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:2083)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:286)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2609)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:863)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:548)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675)
> {noformat}
> This isn't a very critical bug since it's an unusual rack configuration, but 
> it can easily happen during testing.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12567) BlockPlacementPolicyRackFaultTolerant fails with racks with very few nodes

2017-10-04 Thread Andrew Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Wang updated HDFS-12567:
---
Attachment: HDFS-12567.002.patch

Good point, thanks for reviewing Eddy! I had an idea earlier about 
parameterizing this test to have different # nodes/racks, but found that this 
single test was able to reproduce the issue.

> BlockPlacementPolicyRackFaultTolerant fails with racks with very few nodes
> --
>
> Key: HDFS-12567
> URL: https://issues.apache.org/jira/browse/HDFS-12567
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: erasure-coding
>Affects Versions: 3.0.0-alpha1
>Reporter: Andrew Wang
>Assignee: Andrew Wang
>  Labels: hdfs-ec-3.0-must-do
> Attachments: HDFS-12567.001.patch, HDFS-12567.002.patch, 
> HDFS-12567.repro.patch
>
>
> Found this while doing some testing on an internal cluster with an unusual 
> setup. We have a rack with ~20 nodes, then a few more with just a few nodes. 
> It would fail to get (# data blocks) datanodes even though there were plenty 
> of DNs on the rack with 20 DNs.
> I managed to reproduce this same issue in a unit test, stack trace like this:
> {noformat}
> java.io.IOException: File /testfile0 could only be written to 5 of the 6 
> required nodes for RS-6-3-1024k. There are 9 datanode(s) running and no 
> node(s) are excluded in this operation.
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:2083)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:286)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2609)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:863)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:548)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675)
> {noformat}
> This isn't a very critical bug since it's an unusual rack configuration, but 
> it can easily happen during testing.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12567) BlockPlacementPolicyRackFaultTolerant fails with racks with very few nodes

2017-10-04 Thread Andrew Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Wang updated HDFS-12567:
---
Status: Patch Available  (was: Open)

> BlockPlacementPolicyRackFaultTolerant fails with racks with very few nodes
> --
>
> Key: HDFS-12567
> URL: https://issues.apache.org/jira/browse/HDFS-12567
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: erasure-coding
>Affects Versions: 3.0.0-alpha1
>Reporter: Andrew Wang
>Assignee: Andrew Wang
>  Labels: hdfs-ec-3.0-must-do
> Attachments: HDFS-12567.001.patch, HDFS-12567.repro.patch
>
>
> Found this while doing some testing on an internal cluster with an unusual 
> setup. We have a rack with ~20 nodes, then a few more with just a few nodes. 
> It would fail to get (# data blocks) datanodes even though there were plenty 
> of DNs on the rack with 20 DNs.
> I managed to reproduce this same issue in a unit test, stack trace like this:
> {noformat}
> java.io.IOException: File /testfile0 could only be written to 5 of the 6 
> required nodes for RS-6-3-1024k. There are 9 datanode(s) running and no 
> node(s) are excluded in this operation.
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:2083)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:286)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2609)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:863)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:548)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675)
> {noformat}
> This isn't a very critical bug since it's an unusual rack configuration, but 
> it can easily happen during testing.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12567) BlockPlacementPolicyRackFaultTolerant fails with racks with very few nodes

2017-10-04 Thread Andrew Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Wang updated HDFS-12567:
---
Attachment: HDFS-12567.001.patch

Patch attached.

This basically wraps the current logic with a fallback that removes the 
maxNodesPerRack limit when we fail to place enough racks. I used the earlier 
repro patch as the unit test, which now passes. 

> BlockPlacementPolicyRackFaultTolerant fails with racks with very few nodes
> --
>
> Key: HDFS-12567
> URL: https://issues.apache.org/jira/browse/HDFS-12567
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: erasure-coding
>Affects Versions: 3.0.0-alpha1
>Reporter: Andrew Wang
>Assignee: Andrew Wang
>  Labels: hdfs-ec-3.0-must-do
> Attachments: HDFS-12567.001.patch, HDFS-12567.repro.patch
>
>
> Found this while doing some testing on an internal cluster with an unusual 
> setup. We have a rack with ~20 nodes, then a few more with just a few nodes. 
> It would fail to get (# data blocks) datanodes even though there were plenty 
> of DNs on the rack with 20 DNs.
> I managed to reproduce this same issue in a unit test, stack trace like this:
> {noformat}
> java.io.IOException: File /testfile0 could only be written to 5 of the 6 
> required nodes for RS-6-3-1024k. There are 9 datanode(s) running and no 
> node(s) are excluded in this operation.
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:2083)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:286)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2609)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:863)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:548)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675)
> {noformat}
> This isn't a very critical bug since it's an unusual rack configuration, but 
> it can easily happen during testing.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12567) BlockPlacementPolicyRackFaultTolerant fails with racks with very few nodes

2017-09-29 Thread Andrew Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Wang updated HDFS-12567:
---
Attachment: HDFS-12567.repro.patch

Here's a unit test I wrote that reliably reproduces this issue. Haven't been 
able to dig into the root cause yet.

> BlockPlacementPolicyRackFaultTolerant fails with racks with very few nodes
> --
>
> Key: HDFS-12567
> URL: https://issues.apache.org/jira/browse/HDFS-12567
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: erasure-coding
>Affects Versions: 3.0.0-alpha1
>Reporter: Andrew Wang
>  Labels: hdfs-ec-3.0-must-do
> Attachments: HDFS-12567.repro.patch
>
>
> Found this while doing some testing on an internal cluster with an unusual 
> setup. We have a rack with ~20 nodes, then a few more with just a few nodes. 
> It would fail to get (# data blocks) datanodes even though there were plenty 
> of DNs on the rack with 20 DNs.
> I managed to reproduce this same issue in a unit test, stack trace like this:
> {noformat}
> java.io.IOException: File /testfile0 could only be written to 5 of the 6 
> required nodes for RS-6-3-1024k. There are 9 datanode(s) running and no 
> node(s) are excluded in this operation.
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:2083)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:286)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2609)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:863)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:548)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675)
> {noformat}
> This isn't a very critical bug since it's an unusual rack configuration, but 
> it can easily happen during testing.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org