[jira] [Comment Edited] (MESOS-206) Long-running jobs on Hadoop framework do not run to completion

Jessica J (JIRA) Thu, 28 Jun 2012 08:20:46 -0700

    [ 
https://issues.apache.org/jira/browse/MESOS-206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13403153#comment-13403153
 ]


Jessica J edited comment on MESOS-206 at 6/28/12 3:19 PM:
----------------------------------------------------------

Yeah, there are a number of these errors for multiple tasks that receive 
resources, start, and finally arrive at the TASK_FINISHED state. The JobTracker 
shows the error I pasted above for each "unknown" task; the master log says,

I0628 09:48:01.400383 25789 master.cpp:956] Status update from 
slave(1)@[slave-ip]:59707: task [task #] of framework 
201206280753-36284608-5050-25784-0001 is now in state TASK_FINISHED
W0628 09:48:01.400524 25789 master.cpp:988] Status update from 
slave(1)@[slave-ip]:59707 ([slave hostname]): error, couldn't lookup task [task 
#]

These status updates come from multiple slave nodes, as well, so it's not just 
a single node failing.

The first exception I see is in the JobTracker's logs is a 
FileNotFoundException:

12/06/28 08:17:30 INFO mapred.TaskInProgress: Error from 
attempt_201206280805_0002_r_000014_1: Error initializing 
attempt_201206280805_0002_r_000014_1:
java.io.FileNotFoundException: File does not exist: 
hdfs://namenode:54310/scratch/hadoop/mapred/staging/jessicaj/.staging/job_201206280805_0002/job.jar
        at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:534)
        at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:213)
        at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:163)
        at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1164)
        at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1145)
        at 
org.apache.hadoop.mapred.JobLocalizer.localizeJobJarFile(JobLocalizer.java:272)
        at 
org.apache.hadoop.mapred.JobLocalizer.localizeJobFiles(JobLocalizer.java:372)
        at 
org.apache.hadoop.mapred.JobLocalizer.localizeJobFiles(JobLocalizer.java:362)
        at 
org.apache.hadoop.mapred.DefaultTaskController.initializeJob(DefaultTaskController.java:202)
        at org.apache.hadoop.mapred.TaskTracker$4.run(TaskTracker.java:1201)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:416)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
        at 
org.apache.hadoop.mapred.TaskTracker.initializeJob(TaskTracker.java:1176)
        at 
org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:1091)
        at org.apache.hadoop.mapred.TaskTracker$5.run(TaskTracker.java:2372)
        at java.lang.Thread.run(Thread.java:679)

However, the job tracker starts scheduling tasks "with 0 map slots and 0 reduce 
slots" (my first indication that something is wrong) a full 5 minutes before 
this exception occurs, so I'm not sure how things correlate.

The only other exception I can find is in a DataNode's log:

2012-06-28 08:38:05,839 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
DatanodeRegistration([ip-address]:50010, 
storageID=DS-739474830-[ip-address]-50010-1335530999790, infoPort=50075, 
ipcPort=50020):Got exception while serving blk_4193084752334304973_9283 to 
/[another datanode ip-address]:
java.io.IOException: Block blk_4193084752334304973_9283 is not valid.
        at 
org.apache.hadoop.hdfs.server.datanode.FSDataset.getBlockFile(FSDataset.java:1029)
        at 
org.apache.hadoop.hdfs.server.datanode.FSDataset.getLength(FSDataset.java:992)
        at 
org.apache.hadoop.hdfs.server.datanode.FSDataset.getVisibleLength(FSDataset.java:1002)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:94)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:189)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:99) 
        at java.lang.Thread.run(Thread.java:679)


                
      was (Author: esohpromatem):
    Yeah, there are a number of these errors for multiple tasks that receive 
resources, start, and finally arrive at the TASK_FINISHED state. The JobTracker 
shows the error I pasted above for each "unknown" task; the master log says,

I0628 09:48:01.400383 25789 master.cpp:956] Status update from 
slave(1)@[slave-ip]:59707: task [task #] of framework 
201206280753-36284608-5050-25784-0001 is now in state TASK_FINISHED
W0628 09:48:01.400524 25789 master.cpp:988] Status update from 
slave(1)@[slave-ip]:59707 ([slave hostname]): error, couldn't lookup task [task 
#]

These status updates come from multiple slave nodes, as well, so it's not just 
a single node failing.

The first exception I see is in the JobTracker's logs is a 
FileNotFoundException:

12/06/28 08:17:30 INFO mapred.TaskInProgress: Error from 
attempt_201206280805_0002_r_000014_1: Error initializing 
attempt_201206280805_0002_r_000014_1:
java.io.FileNotFoundException: File does not exist: 
hdfs://namenode:54310/scratch/hadoop/mapred/staging/jessicaj/.staging/job_201206280805_0002/job.jar
        at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:534)
        at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:213)
        at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:163)
        at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1164)
        at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1145)
        at 
org.apache.hadoop.mapred.JobLocalizer.localizeJobJarFile(JobLocalizer.java:272)
        at 
org.apache.hadoop.mapred.JobLocalizer.localizeJobFiles(JobLocalizer.java:372)
        at 
org.apache.hadoop.mapred.JobLocalizer.localizeJobFiles(JobLocalizer.java:362)
        at 
org.apache.hadoop.mapred.DefaultTaskController.initializeJob(DefaultTaskController.java:202)
        at org.apache.hadoop.mapred.TaskTracker$4.run(TaskTracker.java:1201)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:416)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
        at 
org.apache.hadoop.mapred.TaskTracker.initializeJob(TaskTracker.java:1176)
        at 
org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:1091)
        at org.apache.hadoop.mapred.TaskTracker$5.run(TaskTracker.java:2372)
        at java.lang.Thread.run(Thread.java:679)

However, the job tracker starts scheduling tasks "with 0 map slots and 0 reduce 
slots" (my first indication that something is wrong) a full 5 minutes before 
this exception occurs, so I'm not sure how things correlate.

                  
> Long-running jobs on Hadoop framework do not run to completion
> --------------------------------------------------------------
>
>                 Key: MESOS-206
>                 URL: https://issues.apache.org/jira/browse/MESOS-206
>             Project: Mesos
>          Issue Type: Bug
>          Components: framework
>            Reporter: Jessica J
>            Priority: Blocker
>
> When I run the MPI and Hadoop frameworks simultaneously with long-running 
> jobs, the Hadoop jobs fail to complete. The MPI job, which is shorter, 
> completes normally, and the Hadoop framework continues for a while, but 
> eventually, although it appears to still be running, it stops making progress 
> on the jobs. The jobtracker keeps running, but each line of output indicates 
> no map or reduce tasks are actually being executed:
> 12/06/08 10:55:41 INFO mapred.FrameworkScheduler: Assigning tasks for 
> [slavehost] with 0 map slots and 0 reduce slots
> I've examined the master's log and noticed this:
> I0608 10:40:43.106740  6317 master.cpp:681] Deactivating framework 
> 201206080825-36284608-5050-6311-0000 as requested by 
> scheduler(1)@[my-ip]:59317
> The framework ID is that of the Hadoop framework. This message is followed by 
> messages indicating the slaves "couldn't lookup task [#]" and "couldn't 
> lookup framework 201206080825-36284608-5050-6311-0000."
> I thought the first time that this error was a fluke since it does not happen 
> with shorter running jobs or with the Hadoop framework running independently 
> (i.e., no MPI), but I have now consistently reproduced it 4 times.
> UPDATE: I just had the same issue occur when running Hadoop + Mesos without 
> the MPI framework running simultaneously.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (MESOS-206) Long-running jobs on Hadoop framework do not run to completion

Reply via email to