Re: Question about running hadoop over mesos

王国栋 Sun, 14 Apr 2013 19:05:44 -0700

Hi Ben,

I put my ideas inline. Please check.


On Mon, Apr 15, 2013 at 8:13 AM, Benjamin Mahler
<[email protected]>wrote:

> So I'm still a little confused here. From what you showed, it looks like
> the 'job -kill' command you posted succeeded on the client side?
>
> [trunk ?]$ ./bin/hadoop job -kill job_201304121621_000113/04/12 16:27:16
> INFO security.UserGroupInformation: JAAS Configuration already set up for
> Hadoop, not re-installing.
> *Killed job job_201304121621_0001**
> *
>
Yes, from the client side, killing the job is successful.


>
> I see that the JobTracker still thinks the job is pending. What happens
> when you re-issue that kill command?
>
The Jobtracker still think the status of job is *PREP*, when I reissue kill
cmd, it seems the same as the first time I issued the cmd.


>
> I'm confused as to why it's still in pending when the JobTracker has
> removed the job:
>
> 13/04/12 16:27:16 INFO mapred.JobTracker: Killing job job_201304121621_0001
> 13/04/12 16:27:16 INFO mapred.JobInProgress: Killing job
> 'job_201304121621_0001'*
>
> Looking at the JobTracker code, it seems like this indeed kills the job:
>
> The code that prints the log line in JobTracker:
>   private synchronized void killJob(JobInProgress job) {
>     LOG.info("Killing job " + job.getJobID());
>     JobStatus prevStatus = (JobStatus)job.getStatus().clone();
>     job.kill();
>
>     // Inform the listeners if the job is killed
>     // Note :
>     //   If the job is killed in the PREP state then the listeners will be
>     //   invoked
>     //   If the job is killed in the RUNNING state then cleanup tasks will
> be
>     //   launched and the updateTaskStatuses() will take care of it
>     JobStatus newStatus = (JobStatus)job.getStatus().clone();
>     if (prevStatus.getRunState() != newStatus.getRunState()
>         && newStatus.getRunState() == JobStatus.KILLED) {
>       JobStatusChangeEvent event =
>         new JobStatusChangeEvent(job, EventType.RUN_STATE_CHANGED,
> prevStatus,
>             newStatus);
>       updateJobInProgressListeners(event);
>     }
>   }
>
> Then JobInProgress.kill():
>   /**
>    * Kill the job and all its component tasks. This method should be called
> from
>    * jobtracker and should return fast as it locks the jobtracker.
>    */
>   public void kill() {
>     boolean killNow = false;
>     synchronized(jobInitKillStatus) {
>       jobInitKillStatus.killed = true;
>       //if not in middle of init, terminate it now
>       if(!jobInitKillStatus.initStarted || jobInitKillStatus.initDone) {
>         //avoiding nested locking by setting flag
>         killNow = true;
>       }
>     }
>     if(killNow) {
>       terminate(JobStatus.KILLED);
>     }
>   }
>
> I don't really see the issue at this point, so any further information /
> debugging on your end that reveals this bug would be very valuable to us.
>
I am trying to debug this on my laptop. I found when I issue kill cmd to
jobtracker, the job status is never changed. It is always *PREP*.
Can you reproduce this on your machine when you follow the steps I
mentioned in the previous mail?


>
> Lastly, what version of the code are you running? If you're running off
> trunk, when did you last update it?
>

I am running with the code in the trunk, it is updated last week.

>
> Ben
>
>
> On Fri, Apr 12, 2013 at 1:32 AM, 王国栋 <[email protected]> wrote:
>
> > Hi Vinod,
> >
> > When I submit the job, the log of jobtracker is as follow
> > ----------------------------------
> > 13/04/12 16:22:37 INFO mapred.MesosScheduler: Added job
> > job_201304121621_0001
> > 13/04/12 16:22:37 INFO mapred.JobTracker: Job job_201304121621_0001 added
> > successfully for user 'guodong' to queue 'default'
> > 13/04/12 16:22:37 INFO mapred.JobTracker: Initializing
> > job_201304121621_0001
> > 13/04/12 16:22:37 INFO mapred.JobInProgress: Initializing
> > job_201304121621_0001
> > 13/04/12 16:22:37 INFO mapred.AuditLogger: USER=guodong IP=127.0.0.1
> > OPERATION=SUBMIT_JOB TARGET=job_201304121621_0001 RESULT=SUCCESS
> > 13/04/12 16:22:37 INFO mapred.JobInProgress: jobToken generated and
> stored
> > with users keys in
> > /tmp/hadoop-guodong/mapred/system/job_201304121621_0001/jobToken
> > 13/04/12 16:22:37 INFO mapred.JobInProgress: Input size for job
> > job_201304121621_0001 = 89502988. Number of splits = 3
> > 13/04/12 16:22:37 INFO net.NetworkTopology: Adding a new node:
> > /default-rack/localhost
> > 13/04/12 16:22:37 INFO mapred.JobInProgress:
> > tip:task_201304121621_0001_m_000000 has split on
> > node:/default-rack/localhost
> > 13/04/12 16:22:37 INFO mapred.JobInProgress:
> > tip:task_201304121621_0001_m_000001 has split on
> > node:/default-rack/localhost
> > 13/04/12 16:22:37 INFO mapred.JobInProgress:
> > tip:task_201304121621_0001_m_000002 has split on
> > node:/default-rack/localhost
> > 13/04/12 16:22:37 INFO mapred.JobInProgress: job_201304121621_0001
> > LOCALITY_WAIT_FACTOR=1.0
> > 13/04/12 16:22:37 INFO mapred.JobInProgress: Job job_201304121621_0001
> > initialized successfully with 3 map tasks and 1 reduce tasks.
> > 13/04/12 16:22:39 INFO mapred.MesosScheduler: JobTracker Status
> >       Pending Map Tasks: 3
> >    Pending Reduce Tasks: 1
> >          Idle Map Slots: 0
> >       Idle Reduce Slots: 0
> >      Inactive Map Slots: 0 (launched but no hearbeat yet)
> >   Inactive Reduce Slots: 0 (launched but no hearbeat yet)
> >        Needed Map Slots: 3
> >     Needed Reduce Slots: 1
> > 13/04/12 16:22:39 INFO mapred.MesosScheduler: Declining offer with
> > insufficient resources for a TaskTracker:
> >   cpus: offered 4.0 needed 1.800000011920929
> >   mem : offered 2731.0 needed 6400.0
> >   disk: offered 75120.0 needed 4096.0
> >   ports:  at least 2 (sufficient)
> > [name: "cpus"
> > type: SCALAR
> > scalar {
> >   value: 4.0
> > }
> > , name: "mem"
> > type: SCALAR
> > scalar {
> >   value: 2731.0
> > }
> > , name: "ports"
> > type: RANGES
> > ranges {
> >   range {
> >     begin: 31000
> >     end: 32000
> >   }
> > }
> > , name: "disk"
> > type: SCALAR
> > scalar {
> >   value: 75120.0
> > }
> > ]
> > 13/04/12 16:22:39 INFO mapred.MesosScheduler: Unable to fully satisfy
> > needed map/reduce slots: 3 map slots 1 reduce slots remaining
> > 13/04/12 16:22:45 INFO mapred.MesosScheduler: JobTracker Status
> >       Pending Map Tasks: 3
> >    Pending Reduce Tasks: 1
> >          Idle Map Slots: 0
> >       Idle Reduce Slots: 0
> >      Inactive Map Slots: 0 (launched but no hearbeat yet)
> >   Inactive Reduce Slots: 0 (launched but no hearbeat yet)
> >        Needed Map Slots: 3
> >     Needed Reduce Slots: 1
> > 13/04/12 16:22:45 INFO mapred.MesosScheduler: Declining offer with
> > insufficient resources for a TaskTracker:
> >   cpus: offered 4.0 needed 1.800000011920929
> >   mem : offered 2731.0 needed 6400.0
> >   disk: offered 75120.0 needed 4096.0
> >   ports:  at least 2 (sufficient)
> > [name: "cpus"
> > type: SCALAR
> > scalar {
> >   value: 4.0
> > }
> > , name: "mem"
> > type: SCALAR
> > scalar {
> >   value: 2731.0
> > }
> > , name: "ports"
> > type: RANGES
> > ranges {
> >   range {
> >     begin: 31000
> >     end: 32000
> >   }
> > }
> > , name: "disk"
> > type: SCALAR
> > scalar {
> >   value: 75120.0
> > }
> > ]
> > 13/04/12 16:22:45 INFO mapred.MesosScheduler: Unable to fully satisfy
> > needed map/reduce slots: 3 map slots 1 reduce slots remaining
> >
> > ----------------------------------
> >
> > the hadoop client log is
> > --------------------------------------------
> > 13/04/12 16:22:36 INFO security.UserGroupInformation: JAAS Configuration
> > already set up for Hadoop, not re-installing.
> > 13/04/12 16:22:36 INFO util.NativeCodeLoader: Loaded the native-hadoop
> > library
> > 13/04/12 16:22:37 INFO input.FileInputFormat: Total input paths to
> process
> > : 1
> > 13/04/12 16:22:37 WARN snappy.LoadSnappy: Snappy native library is
> > available
> > 13/04/12 16:22:37 INFO snappy.LoadSnappy: Snappy native library loaded
> > 13/04/12 16:22:37 INFO mapred.JobClient: Running job:
> job_201304121621_0001
> > 13/04/12 16:22:38 INFO mapred.JobClient:  map 0% reduce 0%
> >
> > --------------------------------------------
> >
> > Since the client is hung up,  I use ctrl-c to stop the client. Then use
> job
> > -status to check the job status.
> >
> > guodong@guodong-Vostro-3400
> > :~/workspace/mesos-trunk/build/hadoop/hadoop-0.20.2-cdh3u3
> > [trunk ?]$ ./bin/hadoop job -status job_201304121621_0001
> > 13/04/12 16:26:22 INFO security.UserGroupInformation: JAAS Configuration
> > already set up for Hadoop, not re-installing.
> >
> > Job: job_201304121621_0001
> > file:
> >
> >
> file:/tmp/hadoop-guodong/mapred/staging/guodong/.staging/job_201304121621_0001/job.xml
> > tracking URL:
> > http://localhost:50030/jobdetails.jsp?jobid=job_201304121621_0001
> > map() completion: 0.0
> > reduce() completion: 0.0
> > Counters: 0
> >
> > Then try to kill the job
> > guodong@guodong-Vostro-3400
> > :~/workspace/mesos-trunk/build/hadoop/hadoop-0.20.2-cdh3u3
> > [trunk ?]$ ./bin/hadoop job -kill job_201304121621_000113/04/12 16:27:16
> > INFO security.UserGroupInformation: JAAS Configuration already set up for
> > Hadoop, not re-installing.
> > Killed job job_201304121621_0001
> >
> > when I kill the job, I can see the log on jobtracker
> > *13/04/12 16:27:13 INFO mapred.MesosScheduler: Unable to fully satisfy
> > needed map/reduce slots: 3 map slots 1 reduce slots remaining*
> > *13/04/12 16:27:16 INFO mapred.JobTracker: Killing job
> > job_201304121621_0001
> > *
> > *13/04/12 16:27:16 INFO mapred.JobInProgress: Killing job
> > 'job_201304121621_0001'*
> >
> > After I kill the job, I can use job-status to check the job status, and
> it
> > is still pending.
> >
> > guodong@guodong-Vostro-3400
> > :~/workspace/mesos-trunk/build/hadoop/hadoop-0.20.2-cdh3u3
> > [trunk ?]$ ./bin/hadoop job -status job_201304121621_0001
> > 13/04/12 16:31:09 INFO security.UserGroupInformation: JAAS Configuration
> > already set up for Hadoop, not re-installing.
> >
> > Job: job_201304121621_0001
> > file:
> >
> >
> file:/tmp/hadoop-guodong/mapred/staging/guodong/.staging/job_201304121621_0001/job.xml
> > tracking URL:
> > http://localhost:50030/jobdetails.jsp?jobid=job_201304121621_0001
> > map() completion: 0.0
> > reduce() completion: 0.0
> > Counters: 0
> >
> >
> > I hope this is helpful !
> >
> > Looking forward to your new progress. And I will also try to go through
> > hadoop code to check it again.
> >
> > Thanks.
> >
> >
> >
> >
> > Guodong
> >
> >
> > On Fri, Apr 12, 2013 at 3:42 PM, Vinod Kone <[email protected]> wrote:
> >
> > > Hi Guodong,
> > >
> > > This is likely a bug on our side.
> > >
> > > Could you paste the output of the client when you issue the kill? Also,
> > the
> > > output of the jobtracker (during the submission and kill of the job)
> > would
> > > also be helpful.
> > >
> > > Thanks,
> > >
> > >
> > >
> > > -- Vinod
> > >
> > >
> > > On Thu, Apr 11, 2013 at 10:43 PM, 王国栋 <[email protected]> wrote:
> > >
> > > > Hi Ben,
> > > >
> > > > I am sorry for my mistake about point 2. The jobtracker jetty server
> > > works
> > > > fine. Yesterday, the execution time for my  test job is too short, so
> > it
> > > is
> > > > finished before the jetty server can show the job status  in running
> > > list.
> > > > Today, I try some big job, and the status is perfectly right.
> > > >
> > > > More information about point 1. I can reproduce this by the following
> > > step.
> > > > 1. set the needed resource for each slot(memory, cpu) in
> > mapred-site.xml.
> > > > Make sure each slot need a lot of resource.
> > > > 2. Given only one mesos slave whose resource is not enough for one
> > mapper
> > > > slot and one reduce slot.
> > > > 3. run the jobtracker, submit a job which need 1 mapper and 1
> reducer.
> > > >
> > > > Then , I can find out that the job is pending due to not enough
> > > resource. I
> > > > can use ctrl-c to stop the hadoop client. But the job is still
> pending.
> > > >
> > > > I try to kill the job with "hadoop job -kill ", but I can not kill
> the
> > > > pending job.
> > > > Before kill the job, I can check the job status is
> > > > *Job Setup:* <
> > > >
> > >
> >
> http://localhost:50030/jobtasks.jsp?jobid=job_201304121059_0001&type=setup&pagenum=1&state=killed
> > > > >
> > > > Pending
> > > > *Job Cleanup:* Pending*
> > > > *
> > > > *
> > > > *
> > > > After I try to kill the job, the job status is
> > > > *Job Setup: *failed
> > > > *Job Cleanup:* Pending
> > > >
> > > > Then the job is hang up. And I can never stop it.
> > > >
> > > > Is it possible that mesos-scheduler miss some killed job event ?
> > > >
> > > >
> > > >
> > > >
> > > > Guodong
> > > >
> > > >
> > > > On Fri, Apr 12, 2013 at 2:59 AM, Benjamin Mahler
> > > > <[email protected]>wrote:
> > > >
> > > > > 'Pending Map Tasks': The nubmer of pending map tasks in the Hadoop
> > > > > JobTracker.
> > > > > 'Pending Reduce Tasks': The nubmer of pending reduce tasks in the
> > > Hadoop
> > > > > JobTracker.
> > > > >
> > > > > Did you successfully kill the job? If so, did you allow some time
> for
> > > the
> > > > > JobTracker to detect that the job was killed.
> > > > >
> > > > > Our scheduler (MesosScheduler.java) simply introspects on the
> > > JobTracker
> > > > > state to determine the number of pending map/reduce tasks in the
> > > system.
> > > > I
> > > > > would expect this to go to 0 sometime after you kill your job, if
> > > that's
> > > > > not the case, I'll need more information to figure out what's going
> > on.
> > > > >
> > > > > Can you elaborate more on your point 2? It almost sounds like
> you're
> > > > > talking to a different JobTracker? Note that the MesosScheduler
> runs
> > > > > *inside* the JobTracker.
> > > > >
> > > > > Also, note that we recently committed a deadlock fix for the Hadoop
> > > > patch:
> > > > > https://reviews.apache.org/r/10352/
> > > > >
> > > > >
> > > > > On Thu, Apr 11, 2013 at 2:31 AM, 王国栋 <[email protected]> wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I am trying to run hadoop over mesos. And I am using the code in
> > the
> > > > > trunk.
> > > > > > But I ran into some problems here. My hadoop version is cdh3u3.
> > > > > >
> > > > > > *1. when a job is pending because of no enough resources, I use
> > > ctrl-c
> > > > to
> > > > > > stop the job client. But I can see the pending mapper and pending
> > > > reducer
> > > > > > are still in the job tracker. Then I try to use "hadoop job -kill
> > > > jobid"
> > > > > to
> > > > > > kill this job, but nothing happens in jobtracker, mapper and
> > reducer
> > > > are
> > > > > > still pending. The log in jobtracker is as follow.*
> > > > > >
> > > > > > 13/04/11 17:21:07 INFO mapred.MesosScheduler: JobTracker Status
> > > > > >       Pending Map Tasks: 1
> > > > > >    Pending Reduce Tasks: 1
> > > > > >          Idle Map Slots: 0
> > > > > >       Idle Reduce Slots: 0
> > > > > >      Inactive Map Slots: 0 (launched but no hearbeat yet)
> > > > > >   Inactive Reduce Slots: 0 (launched but no hearbeat yet)
> > > > > >        Needed Map Slots: 1
> > > > > >     Needed Reduce Slots: 1
> > > > > > 13/04/11 17:21:07 INFO mapred.MesosScheduler: Declining offer
> with
> > > > > > insufficient resources for a TaskTracker:
> > > > > >   cpus: offered 4.0 needed 1.800000011920929
> > > > > >   mem : offered 2731.0 needed 6432.0
> > > > > >   disk: offered 70651.0 needed 4096.0
> > > > > >   ports:  at least 2 (sufficient)
> > > > > > [name: "cpus"
> > > > > > type: SCALAR
> > > > > > scalar {
> > > > > >   value: 4.0
> > > > > > }
> > > > > > , name: "mem"
> > > > > > type: SCALAR
> > > > > > scalar {
> > > > > >   value: 2731.0
> > > > > > }
> > > > > > , name: "ports"
> > > > > > type: RANGES
> > > > > > ranges {
> > > > > >   range {
> > > > > >     begin: 31000
> > > > > >     end: 32000
> > > > > >   }
> > > > > > }
> > > > > > , name: "disk"
> > > > > > type: SCALAR
> > > > > > scalar {
> > > > > >   value: 70651.0
> > > > > > }
> > > > > > ]
> > > > > > 13/04/11 17:21:07 INFO mapred.MesosScheduler: Unable to fully
> > satisfy
> > > > > > needed map/reduce slots: 1 map slots 1 reduce slots remaining
> > > > > >
> > > > > > *2. when we submit the job to the jobtracker, I can not find any
> > > > running
> > > > > > job on jobtracker web interface(
> > > http://localhost:50030/jobtracker.jsp
> > > > ).
> > > > > > But
> > > > > > when the job is finished, I can see the job info in retired job
> in
> > > > > > jobtracker.*
> > > > > >
> > > > > > Any ideas about this ? Thanks a lot.
> > > > > >
> > > > > > Guodong
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Question about running hadoop over mesos

Reply via email to