[jira] Commented: (HADOOP-5852) JobTracker accepts heartbeats before startup is complete

Todd Lipcon (JIRA) Fri, 15 May 2009 14:56:10 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-5852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12709986#action_12709986
 ]


Todd Lipcon commented on HADOOP-5852:
-------------------------------------

>From the JT log:

{noformat}
2009-05-15 14:27:28,775 INFO org.apache.hadoop.mapred.JobTracker: JobTracker up 
at: 8021
2009-05-15 14:27:28,775 INFO org.apache.hadoop.mapred.JobTracker: JobTracker 
webserver: 50030
2009-05-15 14:27:30,521 INFO org.apache.hadoop.mapred.JobTracker: problem 
cleaning system directory: 
hdfs://localhost/var/lib/hadoop/cache/hadoop/mapred/system
org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.dfs.SafeModeException: 
Cannot delete /var/lib/hadoop/cache/hadoop/mapred/system. Name node is in safe 
mode.
The ratio of reported blocks 0.0000 has not reached the threshold 0.9990. Safe 
mode will be turned off automatically.

...

2009-05-15 14:27:32,202 INFO org.apache.hadoop.net.NetworkTopology: Adding a 
new node: /default-rack/todd-laptop
2009-05-15 14:27:32,204 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 
on 8021, call heartbeat(org.apache.hadoop.mapred.tasktrackersta...@7461373f, 
true, true, -1) from 127.0.0.1:36984: error: java.io.IOException: 
java.lang.NullPointerException
java.io.IOException: java.lang.NullPointerException
        at 
org.apache.hadoop.mapred.JobQueueTaskScheduler.assignTasks(JobQueueTaskScheduler.java:85)
        at org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1285)

[infinite loop of above NPEs]
{noformat}

>From the TT log:

{noformat}
2009-05-15 14:27:32,124 INFO org.apache.hadoop.mapred.TaskTracker: TaskTracker 
up at: localhost/127.0.0.1:37148
2009-05-15 14:27:32,124 INFO org.apache.hadoop.mapred.TaskTracker: Starting 
tracker tracker_todd-laptop:localhost/127.0.0.1:37148
2009-05-15 14:27:32,195 INFO org.apache.hadoop.mapred.TaskTracker: Starting 
thread: Map-events fetcher for all reduce tasks on 
tracker_todd-laptop:localhost/127.0.0.1:37148
2009-05-15 14:27:32,208 INFO org.apache.hadoop.mapred.TaskTracker: Resending 
'status' to 'localhost' with reponseId '-1
2009-05-15 14:27:32,220 INFO org.apache.hadoop.mapred.TaskTracker: Resending 
'status' to 'localhost' with reponseId '-1
etc etc
{noformat}

These logs are from 0.18.3, but the code seems to indicate this is still an 
issue in trunk. This happens fairly reliably when I start up all of my hadoop 
daemons at the exact same time -- the TT just needs to send its first heartbeat 
to the JT while the NN is still in safe mode.

The problem lies in the fact that the TaskScheduler's TaskTrackerManager isn't 
set until after the JobTracker constructor returns. The IPC handlers, however, 
are started in the middle of the constructor. Therefore, heartbeats can be 
received when the TaskTrackerManager is null, resulting in the NPE.

Possibly solution #1:
{code}
    taskScheduler = (TaskScheduler) ReflectionUtils.newInstance(schedulerClass, 
conf);
+ taskScheduler.setTaskTrackerManager(this);
{code}
(and remove that line from startTracker())

Possibly solution #2: delay startup of RPC servers until after the JT object is 
fully initialized and in the RUNNING state, or at least has all of its members 
initialized.

I like #1 a lot better - it seems odd that this setter is happening at such a 
late time.

> JobTracker accepts heartbeats before startup is complete
> --------------------------------------------------------
>
>                 Key: HADOOP-5852
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5852
>             Project: Hadoop Core
>          Issue Type: Bug
>            Reporter: Todd Lipcon
>            Priority: Critical
>
> When the JobTracker is instantiated, it starts listening on its RPC 
> interfaces before its startup is complete (ie the constructor is finished 
> executing). Because of this, jt.taskScheduler.taskTrackerManager can be null 
> when the JT receives a heartbeat from a TT. This throws the JT/TT pair into a 
> tight infinite loop (HADOOP-5761)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5852) JobTracker accepts heartbeats before startup is complete

Reply via email to