Re: Question about running hadoop over mesos

Benjamin Mahler Sun, 14 Apr 2013 17:14:26 -0700

So I'm still a little confused here. From what you showed, it looks like
the 'job -kill' command you posted succeeded on the client side?


[trunk ?]$ ./bin/hadoop job -kill job_201304121621_000113/04/12 16:27:16
INFO security.UserGroupInformation: JAAS Configuration already set up for
Hadoop, not re-installing.
*Killed job job_201304121621_0001**
*

I see that the JobTracker still thinks the job is pending. What happens
when you re-issue that kill command?

I'm confused as to why it's still in pending when the JobTracker has
removed the job:

13/04/12 16:27:16 INFO mapred.JobTracker: Killing job job_201304121621_0001
13/04/12 16:27:16 INFO mapred.JobInProgress: Killing job
'job_201304121621_0001'*

Looking at the JobTracker code, it seems like this indeed kills the job:

The code that prints the log line in JobTracker:
  private synchronized void killJob(JobInProgress job) {
    LOG.info("Killing job " + job.getJobID());
    JobStatus prevStatus = (JobStatus)job.getStatus().clone();
    job.kill();

    // Inform the listeners if the job is killed
    // Note :
    //   If the job is killed in the PREP state then the listeners will be
    //   invoked
    //   If the job is killed in the RUNNING state then cleanup tasks will
be
    //   launched and the updateTaskStatuses() will take care of it
    JobStatus newStatus = (JobStatus)job.getStatus().clone();
    if (prevStatus.getRunState() != newStatus.getRunState()
        && newStatus.getRunState() == JobStatus.KILLED) {
      JobStatusChangeEvent event =
        new JobStatusChangeEvent(job, EventType.RUN_STATE_CHANGED,
prevStatus,
            newStatus);
      updateJobInProgressListeners(event);
    }
  }

Then JobInProgress.kill():
  /**
   * Kill the job and all its component tasks. This method should be called
from
   * jobtracker and should return fast as it locks the jobtracker.
   */
  public void kill() {
    boolean killNow = false;
    synchronized(jobInitKillStatus) {
      jobInitKillStatus.killed = true;
      //if not in middle of init, terminate it now
      if(!jobInitKillStatus.initStarted || jobInitKillStatus.initDone) {
        //avoiding nested locking by setting flag
        killNow = true;
      }
    }
    if(killNow) {
      terminate(JobStatus.KILLED);
    }
  }

I don't really see the issue at this point, so any further information /
debugging on your end that reveals this bug would be very valuable to us.

Lastly, what version of the code are you running? If you're running off
trunk, when did you last update it?

Ben


On Fri, Apr 12, 2013 at 1:32 AM, 王国栋 <[email protected]> wrote:

> Hi Vinod,
>
> When I submit the job, the log of jobtracker is as follow
> ----------------------------------
> 13/04/12 16:22:37 INFO mapred.MesosScheduler: Added job
> job_201304121621_0001
> 13/04/12 16:22:37 INFO mapred.JobTracker: Job job_201304121621_0001 added
> successfully for user 'guodong' to queue 'default'
> 13/04/12 16:22:37 INFO mapred.JobTracker: Initializing
> job_201304121621_0001
> 13/04/12 16:22:37 INFO mapred.JobInProgress: Initializing
> job_201304121621_0001
> 13/04/12 16:22:37 INFO mapred.AuditLogger: USER=guodong IP=127.0.0.1
> OPERATION=SUBMIT_JOB TARGET=job_201304121621_0001 RESULT=SUCCESS
> 13/04/12 16:22:37 INFO mapred.JobInProgress: jobToken generated and stored
> with users keys in
> /tmp/hadoop-guodong/mapred/system/job_201304121621_0001/jobToken
> 13/04/12 16:22:37 INFO mapred.JobInProgress: Input size for job
> job_201304121621_0001 = 89502988. Number of splits = 3
> 13/04/12 16:22:37 INFO net.NetworkTopology: Adding a new node:
> /default-rack/localhost
> 13/04/12 16:22:37 INFO mapred.JobInProgress:
> tip:task_201304121621_0001_m_000000 has split on
> node:/default-rack/localhost
> 13/04/12 16:22:37 INFO mapred.JobInProgress:
> tip:task_201304121621_0001_m_000001 has split on
> node:/default-rack/localhost
> 13/04/12 16:22:37 INFO mapred.JobInProgress:
> tip:task_201304121621_0001_m_000002 has split on
> node:/default-rack/localhost
> 13/04/12 16:22:37 INFO mapred.JobInProgress: job_201304121621_0001
> LOCALITY_WAIT_FACTOR=1.0
> 13/04/12 16:22:37 INFO mapred.JobInProgress: Job job_201304121621_0001
> initialized successfully with 3 map tasks and 1 reduce tasks.
> 13/04/12 16:22:39 INFO mapred.MesosScheduler: JobTracker Status
>       Pending Map Tasks: 3
>    Pending Reduce Tasks: 1
>          Idle Map Slots: 0
>       Idle Reduce Slots: 0
>      Inactive Map Slots: 0 (launched but no hearbeat yet)
>   Inactive Reduce Slots: 0 (launched but no hearbeat yet)
>        Needed Map Slots: 3
>     Needed Reduce Slots: 1
> 13/04/12 16:22:39 INFO mapred.MesosScheduler: Declining offer with
> insufficient resources for a TaskTracker:
>   cpus: offered 4.0 needed 1.800000011920929
>   mem : offered 2731.0 needed 6400.0
>   disk: offered 75120.0 needed 4096.0
>   ports:  at least 2 (sufficient)
> [name: "cpus"
> type: SCALAR
> scalar {
>   value: 4.0
> }
> , name: "mem"
> type: SCALAR
> scalar {
>   value: 2731.0
> }
> , name: "ports"
> type: RANGES
> ranges {
>   range {
>     begin: 31000
>     end: 32000
>   }
> }
> , name: "disk"
> type: SCALAR
> scalar {
>   value: 75120.0
> }
> ]
> 13/04/12 16:22:39 INFO mapred.MesosScheduler: Unable to fully satisfy
> needed map/reduce slots: 3 map slots 1 reduce slots remaining
> 13/04/12 16:22:45 INFO mapred.MesosScheduler: JobTracker Status
>       Pending Map Tasks: 3
>    Pending Reduce Tasks: 1
>          Idle Map Slots: 0
>       Idle Reduce Slots: 0
>      Inactive Map Slots: 0 (launched but no hearbeat yet)
>   Inactive Reduce Slots: 0 (launched but no hearbeat yet)
>        Needed Map Slots: 3
>     Needed Reduce Slots: 1
> 13/04/12 16:22:45 INFO mapred.MesosScheduler: Declining offer with
> insufficient resources for a TaskTracker:
>   cpus: offered 4.0 needed 1.800000011920929
>   mem : offered 2731.0 needed 6400.0
>   disk: offered 75120.0 needed 4096.0
>   ports:  at least 2 (sufficient)
> [name: "cpus"
> type: SCALAR
> scalar {
>   value: 4.0
> }
> , name: "mem"
> type: SCALAR
> scalar {
>   value: 2731.0
> }
> , name: "ports"
> type: RANGES
> ranges {
>   range {
>     begin: 31000
>     end: 32000
>   }
> }
> , name: "disk"
> type: SCALAR
> scalar {
>   value: 75120.0
> }
> ]
> 13/04/12 16:22:45 INFO mapred.MesosScheduler: Unable to fully satisfy
> needed map/reduce slots: 3 map slots 1 reduce slots remaining
>
> ----------------------------------
>
> the hadoop client log is
> --------------------------------------------
> 13/04/12 16:22:36 INFO security.UserGroupInformation: JAAS Configuration
> already set up for Hadoop, not re-installing.
> 13/04/12 16:22:36 INFO util.NativeCodeLoader: Loaded the native-hadoop
> library
> 13/04/12 16:22:37 INFO input.FileInputFormat: Total input paths to process
> : 1
> 13/04/12 16:22:37 WARN snappy.LoadSnappy: Snappy native library is
> available
> 13/04/12 16:22:37 INFO snappy.LoadSnappy: Snappy native library loaded
> 13/04/12 16:22:37 INFO mapred.JobClient: Running job: job_201304121621_0001
> 13/04/12 16:22:38 INFO mapred.JobClient:  map 0% reduce 0%
>
> --------------------------------------------
>
> Since the client is hung up,  I use ctrl-c to stop the client. Then use job
> -status to check the job status.
>
> guodong@guodong-Vostro-3400
> :~/workspace/mesos-trunk/build/hadoop/hadoop-0.20.2-cdh3u3
> [trunk ?]$ ./bin/hadoop job -status job_201304121621_0001
> 13/04/12 16:26:22 INFO security.UserGroupInformation: JAAS Configuration
> already set up for Hadoop, not re-installing.
>
> Job: job_201304121621_0001
> file:
>
> file:/tmp/hadoop-guodong/mapred/staging/guodong/.staging/job_201304121621_0001/job.xml
> tracking URL:
> http://localhost:50030/jobdetails.jsp?jobid=job_201304121621_0001
> map() completion: 0.0
> reduce() completion: 0.0
> Counters: 0
>
> Then try to kill the job
> guodong@guodong-Vostro-3400
> :~/workspace/mesos-trunk/build/hadoop/hadoop-0.20.2-cdh3u3
> [trunk ?]$ ./bin/hadoop job -kill job_201304121621_000113/04/12 16:27:16
> INFO security.UserGroupInformation: JAAS Configuration already set up for
> Hadoop, not re-installing.
> Killed job job_201304121621_0001
>
> when I kill the job, I can see the log on jobtracker
> *13/04/12 16:27:13 INFO mapred.MesosScheduler: Unable to fully satisfy
> needed map/reduce slots: 3 map slots 1 reduce slots remaining*
> *13/04/12 16:27:16 INFO mapred.JobTracker: Killing job
> job_201304121621_0001
> *
> *13/04/12 16:27:16 INFO mapred.JobInProgress: Killing job
> 'job_201304121621_0001'*
>
> After I kill the job, I can use job-status to check the job status, and it
> is still pending.
>
> guodong@guodong-Vostro-3400
> :~/workspace/mesos-trunk/build/hadoop/hadoop-0.20.2-cdh3u3
> [trunk ?]$ ./bin/hadoop job -status job_201304121621_0001
> 13/04/12 16:31:09 INFO security.UserGroupInformation: JAAS Configuration
> already set up for Hadoop, not re-installing.
>
> Job: job_201304121621_0001
> file:
>
> file:/tmp/hadoop-guodong/mapred/staging/guodong/.staging/job_201304121621_0001/job.xml
> tracking URL:
> http://localhost:50030/jobdetails.jsp?jobid=job_201304121621_0001
> map() completion: 0.0
> reduce() completion: 0.0
> Counters: 0
>
>
> I hope this is helpful !
>
> Looking forward to your new progress. And I will also try to go through
> hadoop code to check it again.
>
> Thanks.
>
>
>
>
> Guodong
>
>
> On Fri, Apr 12, 2013 at 3:42 PM, Vinod Kone <[email protected]> wrote:
>
> > Hi Guodong,
> >
> > This is likely a bug on our side.
> >
> > Could you paste the output of the client when you issue the kill? Also,
> the
> > output of the jobtracker (during the submission and kill of the job)
> would
> > also be helpful.
> >
> > Thanks,
> >
> >
> >
> > -- Vinod
> >
> >
> > On Thu, Apr 11, 2013 at 10:43 PM, 王国栋 <[email protected]> wrote:
> >
> > > Hi Ben,
> > >
> > > I am sorry for my mistake about point 2. The jobtracker jetty server
> > works
> > > fine. Yesterday, the execution time for my  test job is too short, so
> it
> > is
> > > finished before the jetty server can show the job status  in running
> > list.
> > > Today, I try some big job, and the status is perfectly right.
> > >
> > > More information about point 1. I can reproduce this by the following
> > step.
> > > 1. set the needed resource for each slot(memory, cpu) in
> mapred-site.xml.
> > > Make sure each slot need a lot of resource.
> > > 2. Given only one mesos slave whose resource is not enough for one
> mapper
> > > slot and one reduce slot.
> > > 3. run the jobtracker, submit a job which need 1 mapper and 1 reducer.
> > >
> > > Then , I can find out that the job is pending due to not enough
> > resource. I
> > > can use ctrl-c to stop the hadoop client. But the job is still pending.
> > >
> > > I try to kill the job with "hadoop job -kill ", but I can not kill the
> > > pending job.
> > > Before kill the job, I can check the job status is
> > > *Job Setup:* <
> > >
> >
> http://localhost:50030/jobtasks.jsp?jobid=job_201304121059_0001&type=setup&pagenum=1&state=killed
> > > >
> > > Pending
> > > *Job Cleanup:* Pending*
> > > *
> > > *
> > > *
> > > After I try to kill the job, the job status is
> > > *Job Setup: *failed
> > > *Job Cleanup:* Pending
> > >
> > > Then the job is hang up. And I can never stop it.
> > >
> > > Is it possible that mesos-scheduler miss some killed job event ?
> > >
> > >
> > >
> > >
> > > Guodong
> > >
> > >
> > > On Fri, Apr 12, 2013 at 2:59 AM, Benjamin Mahler
> > > <[email protected]>wrote:
> > >
> > > > 'Pending Map Tasks': The nubmer of pending map tasks in the Hadoop
> > > > JobTracker.
> > > > 'Pending Reduce Tasks': The nubmer of pending reduce tasks in the
> > Hadoop
> > > > JobTracker.
> > > >
> > > > Did you successfully kill the job? If so, did you allow some time for
> > the
> > > > JobTracker to detect that the job was killed.
> > > >
> > > > Our scheduler (MesosScheduler.java) simply introspects on the
> > JobTracker
> > > > state to determine the number of pending map/reduce tasks in the
> > system.
> > > I
> > > > would expect this to go to 0 sometime after you kill your job, if
> > that's
> > > > not the case, I'll need more information to figure out what's going
> on.
> > > >
> > > > Can you elaborate more on your point 2? It almost sounds like you're
> > > > talking to a different JobTracker? Note that the MesosScheduler runs
> > > > *inside* the JobTracker.
> > > >
> > > > Also, note that we recently committed a deadlock fix for the Hadoop
> > > patch:
> > > > https://reviews.apache.org/r/10352/
> > > >
> > > >
> > > > On Thu, Apr 11, 2013 at 2:31 AM, 王国栋 <[email protected]> wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I am trying to run hadoop over mesos. And I am using the code in
> the
> > > > trunk.
> > > > > But I ran into some problems here. My hadoop version is cdh3u3.
> > > > >
> > > > > *1. when a job is pending because of no enough resources, I use
> > ctrl-c
> > > to
> > > > > stop the job client. But I can see the pending mapper and pending
> > > reducer
> > > > > are still in the job tracker. Then I try to use "hadoop job -kill
> > > jobid"
> > > > to
> > > > > kill this job, but nothing happens in jobtracker, mapper and
> reducer
> > > are
> > > > > still pending. The log in jobtracker is as follow.*
> > > > >
> > > > > 13/04/11 17:21:07 INFO mapred.MesosScheduler: JobTracker Status
> > > > >       Pending Map Tasks: 1
> > > > >    Pending Reduce Tasks: 1
> > > > >          Idle Map Slots: 0
> > > > >       Idle Reduce Slots: 0
> > > > >      Inactive Map Slots: 0 (launched but no hearbeat yet)
> > > > >   Inactive Reduce Slots: 0 (launched but no hearbeat yet)
> > > > >        Needed Map Slots: 1
> > > > >     Needed Reduce Slots: 1
> > > > > 13/04/11 17:21:07 INFO mapred.MesosScheduler: Declining offer with
> > > > > insufficient resources for a TaskTracker:
> > > > >   cpus: offered 4.0 needed 1.800000011920929
> > > > >   mem : offered 2731.0 needed 6432.0
> > > > >   disk: offered 70651.0 needed 4096.0
> > > > >   ports:  at least 2 (sufficient)
> > > > > [name: "cpus"
> > > > > type: SCALAR
> > > > > scalar {
> > > > >   value: 4.0
> > > > > }
> > > > > , name: "mem"
> > > > > type: SCALAR
> > > > > scalar {
> > > > >   value: 2731.0
> > > > > }
> > > > > , name: "ports"
> > > > > type: RANGES
> > > > > ranges {
> > > > >   range {
> > > > >     begin: 31000
> > > > >     end: 32000
> > > > >   }
> > > > > }
> > > > > , name: "disk"
> > > > > type: SCALAR
> > > > > scalar {
> > > > >   value: 70651.0
> > > > > }
> > > > > ]
> > > > > 13/04/11 17:21:07 INFO mapred.MesosScheduler: Unable to fully
> satisfy
> > > > > needed map/reduce slots: 1 map slots 1 reduce slots remaining
> > > > >
> > > > > *2. when we submit the job to the jobtracker, I can not find any
> > > running
> > > > > job on jobtracker web interface(
> > http://localhost:50030/jobtracker.jsp
> > > ).
> > > > > But
> > > > > when the job is finished, I can see the job info in retired job in
> > > > > jobtracker.*
> > > > >
> > > > > Any ideas about this ? Thanks a lot.
> > > > >
> > > > > Guodong
> > > > >
> > > >
> > >
> >
>

Re: Question about running hadoop over mesos

Reply via email to