[jira] [Comment Edited] (MESOS-206) Long-running jobs on Hadoop framework do not run to completion

Jessica J (JIRA) Thu, 28 Jun 2012 11:36:47 -0700

    [ 
https://issues.apache.org/jira/browse/MESOS-206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13403173#comment-13403173
 ]


Jessica J edited comment on MESOS-206 at 6/28/12 6:36 PM:
----------------------------------------------------------

It may be clearer if I provide a timeline:

8:05 The master node registers the Hadoop framework and jobs begin, running 
normally.

8:12 The JobTracker starts launching tasks "with 0 map slots and 0 reduce 
slots." No prior exceptions can be found in any logs. (Perhaps these are normal 
job-cleanup tasks?)

8:17 The JobTracker generates a FileNotFoundException

8:37 A DataNode generates 4 IOExceptions for the same block

9:47 The first status update for an "unknown" task shows up in the mesos-master 
log; there are 114 of these, which are replicated in the JobTracker log at 
9:48:17.

9:48:17 mesos-master log says, "Deactivating framework 
201206280753-36284608-5050-25784-0001 as requested by scheduler(1)"

9:48:19 The jobs make a little more progress. (The JobTracker indicates that 
tasks are completing successfully and being scheduling with map/reduce tasks.)

9:48:23 ALL jobs are now being scheduled "with 0 map slots and 0 reduce slots."

9:57 I check the Hadoop web UI and notice the number of map tasks and reduce 
tasks have both reduced 0. Since no further progress is being made, I kill the 
framework.

from 8:10 to 9:48, the mesos-slave logs contain multiple repetitions of this 
warning: W0628 08:11:47.255110 23714 slave.cpp:1027] Status update error: 
couldn't lookup executor for framework 201206280753-36284608-5050-25784-0001
                
      was (Author: esohpromatem):
    It may be clearer if I provide a timeline:

8:05 The master node registers the Hadoop framework and jobs begin, running 
normally.

8:12 The JobTracker starts launching tasks "with 0 map slots and 0 reduce 
slots." No prior exceptions can be found in any logs. (Perhaps these are normal 
job-cleanup tasks?)

8:17 The JobTracker generates a FileNotFoundException

8:37 A DataNode generates 4 IOExceptions for the same block

9:47 The first status update for an "unknown" task shows up in the mesos-master 
log. The JobTracker indicates a large number (20-30?) of "unknown task" status 
updates for a full minute.

9:48:17 mesos-master log says, "Deactivating framework 
201206280753-36284608-5050-25784-0001 as requested by scheduler(1)"

9:48:19 The jobs make a little more progress. (The JobTracker indicates that 
tasks are completing successfully and being scheduling with map/reduce tasks.)

9:48:23 ALL jobs are now being scheduled "with 0 map slots and 0 reduce slots."

9:57 I check the Hadoop web UI and notice the number of map tasks and reduce 
tasks have both reduced 0. Since no further progress is being made, I kill the 
framework.
                  
> Long-running jobs on Hadoop framework do not run to completion
> --------------------------------------------------------------
>
>                 Key: MESOS-206
>                 URL: https://issues.apache.org/jira/browse/MESOS-206
>             Project: Mesos
>          Issue Type: Bug
>          Components: framework
>            Reporter: Jessica J
>            Priority: Blocker
>
> When I run the MPI and Hadoop frameworks simultaneously with long-running 
> jobs, the Hadoop jobs fail to complete. The MPI job, which is shorter, 
> completes normally, and the Hadoop framework continues for a while, but 
> eventually, although it appears to still be running, it stops making progress 
> on the jobs. The jobtracker keeps running, but each line of output indicates 
> no map or reduce tasks are actually being executed:
> 12/06/08 10:55:41 INFO mapred.FrameworkScheduler: Assigning tasks for 
> [slavehost] with 0 map slots and 0 reduce slots
> I've examined the master's log and noticed this:
> I0608 10:40:43.106740  6317 master.cpp:681] Deactivating framework 
> 201206080825-36284608-5050-6311-0000 as requested by 
> scheduler(1)@[my-ip]:59317
> The framework ID is that of the Hadoop framework. This message is followed by 
> messages indicating the slaves "couldn't lookup task [#]" and "couldn't 
> lookup framework 201206080825-36284608-5050-6311-0000."
> I thought the first time that this error was a fluke since it does not happen 
> with shorter running jobs or with the Hadoop framework running independently 
> (i.e., no MPI), but I have now consistently reproduced it 4 times.
> UPDATE: I just had the same issue occur when running Hadoop + Mesos without 
> the MPI framework running simultaneously.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (MESOS-206) Long-running jobs on Hadoop framework do not run to completion

Reply via email to