[
https://issues.apache.org/jira/browse/HADOOP-5852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12709986#action_12709986
]
Todd Lipcon commented on HADOOP-5852:
-------------------------------------
>From the JT log:
{noformat}
2009-05-15 14:27:28,775 INFO org.apache.hadoop.mapred.JobTracker: JobTracker up
at: 8021
2009-05-15 14:27:28,775 INFO org.apache.hadoop.mapred.JobTracker: JobTracker
webserver: 50030
2009-05-15 14:27:30,521 INFO org.apache.hadoop.mapred.JobTracker: problem
cleaning system directory:
hdfs://localhost/var/lib/hadoop/cache/hadoop/mapred/system
org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.dfs.SafeModeException:
Cannot delete /var/lib/hadoop/cache/hadoop/mapred/system. Name node is in safe
mode.
The ratio of reported blocks 0.0000 has not reached the threshold 0.9990. Safe
mode will be turned off automatically.
...
2009-05-15 14:27:32,202 INFO org.apache.hadoop.net.NetworkTopology: Adding a
new node: /default-rack/todd-laptop
2009-05-15 14:27:32,204 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2
on 8021, call heartbeat(org.apache.hadoop.mapred.tasktrackersta...@7461373f,
true, true, -1) from 127.0.0.1:36984: error: java.io.IOException:
java.lang.NullPointerException
java.io.IOException: java.lang.NullPointerException
at
org.apache.hadoop.mapred.JobQueueTaskScheduler.assignTasks(JobQueueTaskScheduler.java:85)
at org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1285)
[infinite loop of above NPEs]
{noformat}
>From the TT log:
{noformat}
2009-05-15 14:27:32,124 INFO org.apache.hadoop.mapred.TaskTracker: TaskTracker
up at: localhost/127.0.0.1:37148
2009-05-15 14:27:32,124 INFO org.apache.hadoop.mapred.TaskTracker: Starting
tracker tracker_todd-laptop:localhost/127.0.0.1:37148
2009-05-15 14:27:32,195 INFO org.apache.hadoop.mapred.TaskTracker: Starting
thread: Map-events fetcher for all reduce tasks on
tracker_todd-laptop:localhost/127.0.0.1:37148
2009-05-15 14:27:32,208 INFO org.apache.hadoop.mapred.TaskTracker: Resending
'status' to 'localhost' with reponseId '-1
2009-05-15 14:27:32,220 INFO org.apache.hadoop.mapred.TaskTracker: Resending
'status' to 'localhost' with reponseId '-1
etc etc
{noformat}
These logs are from 0.18.3, but the code seems to indicate this is still an
issue in trunk. This happens fairly reliably when I start up all of my hadoop
daemons at the exact same time -- the TT just needs to send its first heartbeat
to the JT while the NN is still in safe mode.
The problem lies in the fact that the TaskScheduler's TaskTrackerManager isn't
set until after the JobTracker constructor returns. The IPC handlers, however,
are started in the middle of the constructor. Therefore, heartbeats can be
received when the TaskTrackerManager is null, resulting in the NPE.
Possibly solution #1:
{code}
taskScheduler = (TaskScheduler) ReflectionUtils.newInstance(schedulerClass,
conf);
+ taskScheduler.setTaskTrackerManager(this);
{code}
(and remove that line from startTracker())
Possibly solution #2: delay startup of RPC servers until after the JT object is
fully initialized and in the RUNNING state, or at least has all of its members
initialized.
I like #1 a lot better - it seems odd that this setter is happening at such a
late time.
> JobTracker accepts heartbeats before startup is complete
> --------------------------------------------------------
>
> Key: HADOOP-5852
> URL: https://issues.apache.org/jira/browse/HADOOP-5852
> Project: Hadoop Core
> Issue Type: Bug
> Reporter: Todd Lipcon
> Priority: Critical
>
> When the JobTracker is instantiated, it starts listening on its RPC
> interfaces before its startup is complete (ie the constructor is finished
> executing). Because of this, jt.taskScheduler.taskTrackerManager can be null
> when the JT receives a heartbeat from a TT. This throws the JT/TT pair into a
> tight infinite loop (HADOOP-5761)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.