[ 
https://issues.apache.org/jira/browse/HDFS-10755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596881#comment-16596881
 ] 

Eric Badger commented on HDFS-10755:
------------------------------------

[~kennychang] were you actually able to reproduce the error when the patch is 
applied? This patch is from a few years ago so I don't remember the analysis. 
But it looks like it goes out and set the port in the conf to grab an ephemeral 
port. So I'm not sure why that would fail with a port bind issue.

> TestDecommissioningStatus BindException Failure
> -----------------------------------------------
>
>                 Key: HDFS-10755
>                 URL: https://issues.apache.org/jira/browse/HDFS-10755
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Eric Badger
>            Assignee: Eric Badger
>            Priority: Major
>         Attachments: HDFS-10755.001.patch, HDFS-10755.002.patch
>
>
> Tests in TestDecomissioningStatus call MiniDFSCluster.dataNodeRestart(). They 
> are required to come back up on the same (initially ephemeral) port that they 
> were on before being shutdown. Because of this, there is an inherent race 
> condition where another process could bind to the port while the datanode is 
> down. If this happens then we get a BindException failure. However, all of 
> the tests in TestDecommissioningStatus depend on the cluster being up and 
> running for them to run correctly. So if a test blows up the cluster, the 
> subsequent tests will also fail. Below I show the BindException failure as 
> well as the subsequent test failure that occurred.
> {noformat}
> java.net.BindException: Problem binding to [localhost:35370] 
> java.net.BindException: Address already in use; For more details see:  
> http://wiki.apache.org/hadoop/BindException
>       at sun.nio.ch.Net.bind0(Native Method)
>       at sun.nio.ch.Net.bind(Net.java:436)
>       at sun.nio.ch.Net.bind(Net.java:428)
>       at 
> sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:214)
>       at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
>       at org.apache.hadoop.ipc.Server.bind(Server.java:430)
>       at org.apache.hadoop.ipc.Server$Listener.<init>(Server.java:768)
>       at org.apache.hadoop.ipc.Server.<init>(Server.java:2391)
>       at org.apache.hadoop.ipc.RPC$Server.<init>(RPC.java:951)
>       at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server.<init>(ProtobufRpcEngine.java:523)
>       at 
> org.apache.hadoop.ipc.ProtobufRpcEngine.getServer(ProtobufRpcEngine.java:498)
>       at org.apache.hadoop.ipc.RPC$Builder.build(RPC.java:796)
>       at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.initIpcServer(DataNode.java:802)
>       at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:1134)
>       at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.<init>(DataNode.java:429)
>       at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:2387)
>       at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:2274)
>       at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:2321)
>       at 
> org.apache.hadoop.hdfs.MiniDFSCluster.restartDataNode(MiniDFSCluster.java:2037)
>       at 
> org.apache.hadoop.hdfs.server.namenode.TestDecommissioningStatus.testDecommissionDeadDN(TestDecommissioningStatus.java:426)
> {noformat}
> {noformat}
> java.lang.AssertionError: Number of Datanodes  expected:<2> but was:<1>
>       at org.junit.Assert.fail(Assert.java:88)
>       at org.junit.Assert.failNotEquals(Assert.java:743)
>       at org.junit.Assert.assertEquals(Assert.java:118)
>       at org.junit.Assert.assertEquals(Assert.java:555)
>       at 
> org.apache.hadoop.hdfs.server.namenode.TestDecommissioningStatus.testDecommissionStatus(TestDecommissioningStatus.java:275)
> {noformat}
> I don't think there's any way to avoid the inherent race condition with 
> getting the same ephemeral port, but we can definitely fix the tests so that 
> it doesn't cause subsequent tests to fail. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to