[ https://issues.apache.org/jira/browse/HADOOP-4659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12650567#action_12650567 ]
Steve Loughran commented on HADOOP-4659: ---------------------------------------- I'm going to push out my updated lifecycle patches shortly. One test I have there brings up a tasktracker without the rest of the infrastructure (DFS, jobtracker); it is now hanging until the test times out, spinning while things get set up, waiting for a job tracker that never arrives. [junit] Tue Nov 25 13:50:13 2008 [junit] BEA JRockit(R) R27.4.0-90-89592-1.6.0_02-20070928-1715-linux-x86_64 [junit] "Main Thread" id=1 idx=0x4 tid=4074 prio=5 alive, in native, sleeping, native_waiting [junit] at java/lang/Thread.sleep(J)V(Native Method) [junit] at org/apache/hadoop/ipc/Client$Connection.handleConnectionFailure(Client.java:364) [junit] at org/apache/hadoop/ipc/Client$Connection.setupIOstreams(Client.java:310) [junit] ^-- Holding lock: org/apache/hadoop/ipc/[EMAIL PROTECTED] lock] [junit] at org/apache/hadoop/ipc/Client$Connection.access$1800(Client.java:177) [junit] at org/apache/hadoop/ipc/Client.getConnection(Client.java:792) [junit] at org/apache/hadoop/ipc/Client.call(Client.java:688) [junit] at org/apache/hadoop/ipc/RPC$Invoker.invoke(RPC.java:215) [junit] at org/apache/hadoop/mapred/$Proxy0.getProtocolVersion(Ljava/lang/String;J)J(Unknown Source) [junit] at org/apache/hadoop/ipc/RPC.getProxy(RPC.java:347) [junit] at org/apache/hadoop/ipc/RPC.getProxy(RPC.java:334) [junit] at org/apache/hadoop/ipc/RPC.getProxy(RPC.java:371) [junit] at org/apache/hadoop/ipc/RPC.waitForProxy(RPC.java:308) [junit] at org/apache/hadoop/ipc/RPC.waitForProxy(RPC.java:285) [junit] at org/apache/hadoop/mapred/TaskTracker.initialize(TaskTracker.java:454) [junit] ^-- Holding lock: org/apache/hadoop/mapred/[EMAIL PROTECTED] [junit] at org/apache/hadoop/mapred/TaskTracker.innerStart(TaskTracker.java:830) [junit] ^-- Holding lock: org/apache/hadoop/mapred/[EMAIL PROTECTED] lock] [junit] at org/apache/hadoop/util/Service.start(Service.java:186) [junit] at org/apache/hadoop/util/Service.deploy(Service.java:654) [junit] at org/apache/hadoop/mapred/TaskTracker.<init>(TaskTracker.java:965) [junit] at org/apache/hadoop/mapred/TaskTracker.<init>(TaskTracker.java:948) What I propose here is to move TaskTracker to have a timeout on its waitForProxy() operation, so that if the TT comes up before the JT, there's a bit of leeway, but eventually the TT will conclude that it is an orphan and that it cannot start up > Root cause of connection failure is being lost to code that uses it for > delaying startup > ---------------------------------------------------------------------------------------- > > Key: HADOOP-4659 > URL: https://issues.apache.org/jira/browse/HADOOP-4659 > Project: Hadoop Core > Issue Type: Bug > Components: ipc > Affects Versions: 0.18.3 > Reporter: Steve Loughran > Assignee: Steve Loughran > Priority: Blocker > Fix For: 0.18.3 > > Attachments: connectRetry.patch, hadoop-4659.patch, > hadoop-4659.patch, rpcConn.patch, rpcConn1.patch > > > ipc.Client the root cause of a connection failure is being lost as the > exception is wrapped, hence the outside code, the one that looks for that > root cause, isn't working as expected. The results is you can't bring up a > task tracker before job tracker, and probably the same for a datanode before > a namenode. The change that triggered this is not yet located, I had thought > it was HADOOP-3844 but I no longer believe this is the case. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.