[ 
https://issues.apache.org/jira/browse/HDFS-10755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger updated HDFS-10755:
-------------------------------
    Attachment: HDFS-10755.001.patch

Attaching a patch that catches the BindException during the datanode restart 
and then restarts the datanode on a new port before throwing the original bind 
exception. This way, we will not affect subsequent tests that rely on the 
cluster being in a usable state. 

Additionally, miniDFSCluster.restartDataNode() seems to be doing some weird 
things with the keepPort variable that I can't quite figure out. If you call 
restartDataNode with keepPort=true it will fail on a bind exception if there is 
something else on the port. If you then catch the exception and call 
restartDataNode with keepPort=false, it will fail with a bind exception on the 
same port. This leads me to believe that the keepPort if statement doesn't 
actually do anything and that it will just reuse the same port regardless of 
the flag because it is saved in the conf. But, if I remove the if statement, 
then when I run restartDataNode with keepPort=true it will get a new ephemeral 
port if the port that it used before is in use. 

I'm at a loss for why that happens, so I explicitly set the ports when 
keepPort=true and explicitly reset them when keepPort=false. If anyone else 
could shed some light onto this, I would appreciate it. 

> TestDecommissioningStatus BindException Failure
> -----------------------------------------------
>
>                 Key: HDFS-10755
>                 URL: https://issues.apache.org/jira/browse/HDFS-10755
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Eric Badger
>            Assignee: Eric Badger
>         Attachments: HDFS-10755.001.patch
>
>
> Tests in TestDecomissioningStatus call MiniDFSCluster.dataNodeRestart(). They 
> are required to come back up on the same (initially ephemeral) port that they 
> were on before being shutdown. Because of this, there is an inherent race 
> condition where another process could bind to the port while the datanode is 
> down. If this happens then we get a BindException failure. However, all of 
> the tests in TestDecommissioningStatus depend on the cluster being up and 
> running for them to run correctly. So if a test blows up the cluster, the 
> subsequent tests will also fail. Below I show the BindException failure as 
> well as the subsequent test failure that occurred.
> {noformat}
> java.net.BindException: Problem binding to [localhost:35370] 
> java.net.BindException: Address already in use; For more details see:  
> http://wiki.apache.org/hadoop/BindException
>       at sun.nio.ch.Net.bind0(Native Method)
>       at sun.nio.ch.Net.bind(Net.java:436)
>       at sun.nio.ch.Net.bind(Net.java:428)
>       at 
> sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:214)
>       at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
>       at org.apache.hadoop.ipc.Server.bind(Server.java:430)
>       at org.apache.hadoop.ipc.Server$Listener.<init>(Server.java:768)
>       at org.apache.hadoop.ipc.Server.<init>(Server.java:2391)
>       at org.apache.hadoop.ipc.RPC$Server.<init>(RPC.java:951)
>       at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server.<init>(ProtobufRpcEngine.java:523)
>       at 
> org.apache.hadoop.ipc.ProtobufRpcEngine.getServer(ProtobufRpcEngine.java:498)
>       at org.apache.hadoop.ipc.RPC$Builder.build(RPC.java:796)
>       at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.initIpcServer(DataNode.java:802)
>       at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:1134)
>       at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.<init>(DataNode.java:429)
>       at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:2387)
>       at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:2274)
>       at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:2321)
>       at 
> org.apache.hadoop.hdfs.MiniDFSCluster.restartDataNode(MiniDFSCluster.java:2037)
>       at 
> org.apache.hadoop.hdfs.server.namenode.TestDecommissioningStatus.testDecommissionDeadDN(TestDecommissioningStatus.java:426)
> {noformat}
> {noformat}
> java.lang.AssertionError: Number of Datanodes  expected:<2> but was:<1>
>       at org.junit.Assert.fail(Assert.java:88)
>       at org.junit.Assert.failNotEquals(Assert.java:743)
>       at org.junit.Assert.assertEquals(Assert.java:118)
>       at org.junit.Assert.assertEquals(Assert.java:555)
>       at 
> org.apache.hadoop.hdfs.server.namenode.TestDecommissioningStatus.testDecommissionStatus(TestDecommissioningStatus.java:275)
> {noformat}
> I don't think there's any way to avoid the inherent race condition with 
> getting the same ephemeral port, but we can definitely fix the tests so that 
> it doesn't cause subsequent tests to fail. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to