[jira] [Commented] (HDFS-3364) TestFileAppend4.testRecoverFinalizedBlock occasionally times out

Junping Du (JIRA) Wed, 04 Jul 2012 01:14:41 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13406354#comment-13406354
 ]


Junping Du commented on HDFS-3364:
----------------------------------

I saw this error in pre-commit test for HADOOP-8472, and also cannot reproduce 
in local. The full log is 
https://builds.apache.org/job/PreCommit-HADOOP-Build/1155//testReport/org.apache.hadoop.hdfs/TestFileAppend4/testCompleteOtherLeaseHoldersFile/
 and the relevant log is as following:
{noformat}
2012-06-29 16:26:22,187 INFO  ipc.Server (Server.java:stop(1991)) - Stopping 
server on 46047
2012-06-29 16:26:22,424 DEBUG datanode.DataNode 
(BPServiceActor.java:sendHeartBeat(431)) - Sending heartbeat from service 
actor: Block pool BP-166940465-67.195.138.20-1340987178877 (storage id 
DS-851929818-67.195.138.20-36687-1340987179189) service to 
localhost/127.0.0.1:49801
2012-06-29 16:26:22,424 DEBUG datanode.DataNode 
(BPServiceActor.java:sendHeartBeat(431)) - Sending heartbeat from service 
actor: Block pool BP-166940465-67.195.138.20-1340987178877 (storage id 
DS-1404243958-67.195.138.20-36424-1340987179356) service to 
localhost/127.0.0.1:49801
2012-06-29 16:26:22,424 DEBUG datanode.DataNode 
(BPServiceActor.java:sendHeartBeat(431)) - Sending heartbeat from service 
actor: Block pool BP-166940465-67.195.138.20-1340987178877 (storage id 
DS-1672148922-67.195.138.20-53642-1340987179269) service to 
localhost/127.0.0.1:49801
2012-06-29 16:26:22,425 INFO  ipc.Server (Server.java:run(638)) - Stopping IPC 
Server listener on 46047
2012-06-29 16:26:22,425 INFO  ipc.Server (Server.java:run(780)) - Stopping IPC 
Server Responder
2012-06-29 16:26:22,425 INFO  datanode.DataNode (DataNode.java:shutdown(1068)) 
- Waiting for threadgroup to exit, active threads is 1
2012-06-29 16:26:22,428 WARN  ipc.Server (Server.java:processResponse(979)) - 
IPC Server Responder, call 
org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
127.0.0.1:51257: output error
2012-06-29 16:26:22,429 INFO  ipc.Server (Server.java:run(1745)) - IPC Server 
handler 9 on 49801 caught an exception
java.nio.channels.ClosedChannelException
        at 
sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:133)
        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
        at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2131)
        at org.apache.hadoop.ipc.Server.access$2000(Server.java:107)
        at 
org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:930)
        at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:994)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1738)
{noformat}

>From this log, the problem happens in shutting down 3rd datanodes as previous 
>2 datanodes are shut down successfully. It looks like when shutting down 
>ipcserver on 3rd datanode, it happens to do heartbeat for BPServiceActor in 
>offerService() so some heartbeat exceptions are thrown out.
Also, the following code which is for interrupt offerService() is not taking 
effective which means offerService() pending on somewhere else? like heartbeat 
response or pendingIncrementalBR. Does adding catch InterruptedException(...) 
in whole offerService() can help here? 

{code}
  synchronized(pendingIncrementalBR) {
    if (waitTime > 0 && pendingReceivedRequests == 0) {
      try {
        pendingIncrementalBR.wait(waitTime);
      } catch (InterruptedException ie) {
        LOG.warn("BPOfferService for " + this + " interrupted");
      }
    }
  }
{code}
                
> TestFileAppend4.testRecoverFinalizedBlock occasionally times out
> ----------------------------------------------------------------
>
>                 Key: HDFS-3364
>                 URL: https://issues.apache.org/jira/browse/HDFS-3364
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 2.0.0-alpha
>            Reporter: Eli Collins
>
> I've seen TestFileAppend4.testRecoverFinalizedBlock shutdown occasionally 
> time out in jenkins. Doesn't fail for me locally.
> {noformat}
> test timed out after 60000 milliseconds
> Stacktrace
> java.lang.Exception: test timed out after 60000 milliseconds
>       at java.lang.Object.wait(Native Method)
>       at java.lang.Thread.join(Thread.java:1186)
>       at java.lang.Thread.join(Thread.java:1239)
>       at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.join(BPServiceActor.java:473)
>       at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.join(BPOfferService.java:259)
>       at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolManager.shutDownAll(BlockPoolManager.java:117)
>       at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.shutdown(DataNode.java:1098)
>       at 
> org.apache.hadoop.hdfs.MiniDFSCluster.shutdownDataNodes(MiniDFSCluster.java:1280)
>       at 
> org.apache.hadoop.hdfs.MiniDFSCluster.shutdown(MiniDFSCluster.java:1260)
>       at 
> org.apache.hadoop.hdfs.TestFileAppend4.testRecoverFinalizedBlock(TestFileAppend4.java:208)
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-3364) TestFileAppend4.testRecoverFinalizedBlock occasionally times out

Reply via email to