[ 
https://issues.apache.org/jira/browse/HADOOP-4659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12649087#action_12649087
 ] 

Steve Loughran commented on HADOOP-4659:
----------------------------------------

I'm going to put a merged patch up, but although the RPC test is passing, the 
spinning appears to be creating deadlock in TestFileCreationClient; relevant 
bits of the thread dump to follow.

1. We're sleeping here holding [EMAIL PROTECTED]

    [junit] "DataStreamer for file /wrwelkj/file9 block 
blk_-4298389317957709021_1010" id=133 idx=0x210 tid=25976 prio=5 alive, in 
native, sleeping, native_waiting, daemon
    [junit]     at java/lang/Thread.sleep(J)V(Native Method)
    [junit]     at 
org/apache/hadoop/ipc/Client$Connection.handleConnectionFailure(Client.java:373)
    [junit]     at 
org/apache/hadoop/ipc/Client$Connection.setupIOstreams(Client.java:310)
    [junit]     ^-- Holding lock: org/apache/hadoop/ipc/[EMAIL PROTECTED] lock]
    [junit]     at 
org/apache/hadoop/ipc/Client$Connection.access$1700(Client.java:177)
    [junit]     at org/apache/hadoop/ipc/Client.getConnection(Client.java:791)
    [junit]     at org/apache/hadoop/ipc/Client.call(Client.java:697)
    [junit]     at org/apache/hadoop/ipc/RPC$Invoker.invoke(RPC.java:216)
    [junit]     at $Proxy7.getProtocolVersion(Ljava/lang/String;J)J(Unknown 
Source)
    [junit]     at org/apache/hadoop/ipc/RPC.getProxy(RPC.java:340)
    [junit]     at org/apache/hadoop/ipc/RPC.getProxy(RPC.java:327)
    [junit]     at org/apache/hadoop/ipc/RPC.getProxy(RPC.java:364)
    [junit]     at org/apache/hadoop/ipc/RPC.waitForProxy(RPC.java:299)
    [junit]     at org/apache/hadoop/ipc/RPC.waitForProxy(RPC.java:286)

2. Which is blocking this
    [junit]     -- Blocked trying to get lock: org/apache/hadoop/ipc/[EMAIL 
PROTECTED] lock]
    [junit]     at jrockit/vm/Threads.sleep(I)V(Native Method)
    [junit]     at 
jrockit/vm/Locks.waitForThinRelease(Locks.java:1233)[optimized]
    [junit]     at 
jrockit/vm/Locks.monitorEnterSecondStage(Locks.java:1307)[optimized]
    [junit]     at jrockit/vm/Locks.monitorEnter(Locks.java:2389)[optimized]
    [junit]     at 
org/apache/hadoop/ipc/Client$Connection.addCall(Client.java:219)
    [junit]     at 
org/apache/hadoop/ipc/Client$Connection.access$1600(Client.java:177)
    [junit]     at org/apache/hadoop/ipc/Client.getConnection(Client.java:785)
    [junit]     at org/apache/hadoop/ipc/Client.call(Client.java:697)
    [junit]     at org/apache/hadoop/ipc/RPC$Invoker.invoke(RPC.java:216)
    [junit]     at $Proxy7.getProtocolVersion(Ljava/lang/String;J)J(Unknown 
Source)
    [junit]     at org/apache/hadoop/ipc/RPC.getProxy(RPC.java:340)
    [junit]     at org/apache/hadoop/ipc/RPC.getProxy(RPC.java:327)
    [junit]     at org/apache/hadoop/ipc/RPC.getProxy(RPC.java:364)
    [junit]     at org/apache/hadoop/ipc/RPC.waitForProxy(RPC.java:299)
    [junit]     at org/apache/hadoop/ipc/RPC.waitForProxy(RPC.java:286)
    [junit]     at 
org/apache/hadoop/hdfs/DFSClient.createClientDatanodeProtocolProxy(DFSClient.java:141)
    [junit]     at 
org/apache/hadoop/hdfs/DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2469)
    [junit]     at 
org/apache/hadoop/hdfs/DFSClient$DFSOutputStream.access$1700(DFSClient.java:1997)
    [junit]     at 
org/apache/hadoop/hdfs/DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2160)

and this

    [junit]     -- Blocked trying to get lock: org/apache/hadoop/ipc/[EMAIL 
PROTECTED] lock]
    [junit]     at jrockit/vm/Threads.sleep(I)V(Native Method)
    [junit]     at 
jrockit/vm/Locks.waitForThinRelease(Locks.java:1233)[optimized]
    [junit]     at 
jrockit/vm/Locks.monitorEnterSecondStage(Locks.java:1307)[optimized]
    [junit]     at jrockit/vm/Locks.monitorEnter(Locks.java:2389)[optimized]
    [junit]     at 
org/apache/hadoop/ipc/Client$Connection.addCall(Client.java:219)
    [junit]     at 
org/apache/hadoop/ipc/Client$Connection.access$1600(Client.java:177)
    [junit]     at org/apache/hadoop/ipc/Client.getConnection(Client.java:785)
    [junit]     at org/apache/hadoop/ipc/Client.call(Client.java:697)
    [junit]     at org/apache/hadoop/ipc/RPC$Invoker.invoke(RPC.java:216)
    [junit]     at $Proxy7.getProtocolVersion(Ljava/lang/String;J)J(Unknown 
Source)
    [junit]     at org/apache/hadoop/ipc/RPC.getProxy(RPC.java:340)
    [junit]     at org/apache/hadoop/ipc/RPC.getProxy(RPC.java:327)
    [junit]     at org/apache/hadoop/ipc/RPC.getProxy(RPC.java:364)
    [junit]     at org/apache/hadoop/ipc/RPC.waitForProxy(RPC.java:299)
    [junit]     at org/apache/hadoop/ipc/RPC.waitForProxy(RPC.java:286)
    [junit]     at 
org/apache/hadoop/hdfs/DFSClient.createClientDatanodeProtocolProxy(DFSClient.java:141)
    [junit]     at 
org/apache/hadoop/hdfs/DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2469)
    [junit]     at 
org/apache/hadoop/hdfs/DFSClient$DFSOutputStream.access$1700(DFSClient.java:1997)
    [junit]     at 
org/apache/hadoop/hdfs/DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2160)
    [junit]     ^-- Holding lock: java/util/[EMAIL PROTECTED] lock]
    [junit]     at jrockit/vm/RNI.c2java(JJJJJ)V(Native Method)
    [junit]     -- end of trace
    [junit] "DataStreamer for file /wrwelkj/file5 block 
blk_7479178383257153500_1010" id=127

and this
idx=0x200 tid=25971 prio=5 alive, in native, blocked, daemon
    [junit]     -- Blocked trying to get lock: org/apache/hadoop/ipc/[EMAIL 
PROTECTED] lock]
    [junit]     at jrockit/vm/Threads.sleep(I)V(Native Method)
    [junit]     at 
jrockit/vm/Locks.waitForThinRelease(Locks.java:1233)[optimized]
    [junit]     at 
jrockit/vm/Locks.monitorEnterSecondStage(Locks.java:1307)[optimized]
    [junit]     at jrockit/vm/Locks.monitorEnter(Locks.java:2389)[optimized]
    [junit]     at 
org/apache/hadoop/ipc/Client$Connection.addCall(Client.java:219)
    [junit]     at 
org/apache/hadoop/ipc/Client$Connection.access$1600(Client.java:177)
    [junit]     at org/apache/hadoop/ipc/Client.getConnection(Client.java:785)
    [junit]     at org/apache/hadoop/ipc/Client.call(Client.java:697)
    [junit]     at org/apache/hadoop/ipc/RPC$Invoker.invoke(RPC.java:216)
    [junit]     at $Proxy7.getProtocolVersion(Ljava/lang/String;J)J(Unknown 
Source)
    [junit]     at org/apache/hadoop/ipc/RPC.getProxy(RPC.java:340)
    [junit]     at org/apache/hadoop/ipc/RPC.getProxy(RPC.java:327)
    [junit]     at org/apache/hadoop/ipc/RPC.getProxy(RPC.java:364)
    [junit]     at org/apache/hadoop/ipc/RPC.waitForProxy(RPC.java:299)
    [junit]     at org/apache/hadoop/ipc/RPC.waitForProxy(RPC.java:286)
    [junit]     at 
org/apache/hadoop/hdfs/DFSClient.createClientDatanodeProtocolProxy(DFSClient.java:141)
    [junit]     at 
org/apache/hadoop/hdfs/DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2469)
    [junit]     at 
org/apache/hadoop/hdfs/DFSClient$DFSOutputStream.access$1700(DFSClient.java:1997)
    [junit]     at 
org/apache/hadoop/hdfs/DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2160)
    [junit]     ^-- Holding lock: java/util/[EMAIL PROTECTED] lock]
    [junit]     at jrockit/vm/RNI.c2java(JJJJJ)V(Native Method)

So: the sleep in setupIOStreams appears to be blocking the other operations. 
for some reason, <junit> isn't timing out or killing the process, which implies 
this is fairly serious. 

> Root cause of connection failure is being lost to code that uses it for 
> delaying startup
> ----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4659
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4659
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: ipc
>    Affects Versions: 0.18.3
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>            Priority: Blocker
>             Fix For: 0.18.3
>
>         Attachments: connectRetry.patch, hadoop-4659.patch, rpcConn.patch
>
>
> ipc.Client the root cause of a connection failure is being lost as the 
> exception is wrapped, hence the outside code, the one that looks for that 
> root cause, isn't working as expected. The results is you can't bring up a 
> task tracker before job tracker, and probably the same for a datanode before 
> a  namenode. The change that triggered this is not yet located, I had thought 
> it was HADOOP-3844 but I no longer believe this is the case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to