Hi Ben, I put my ideas inline. Please check.
On Mon, Apr 15, 2013 at 8:13 AM, Benjamin Mahler <[email protected]>wrote: > So I'm still a little confused here. From what you showed, it looks like > the 'job -kill' command you posted succeeded on the client side? > > [trunk ?]$ ./bin/hadoop job -kill job_201304121621_000113/04/12 16:27:16 > INFO security.UserGroupInformation: JAAS Configuration already set up for > Hadoop, not re-installing. > *Killed job job_201304121621_0001** > * > Yes, from the client side, killing the job is successful. > > I see that the JobTracker still thinks the job is pending. What happens > when you re-issue that kill command? > The Jobtracker still think the status of job is *PREP*, when I reissue kill cmd, it seems the same as the first time I issued the cmd. > > I'm confused as to why it's still in pending when the JobTracker has > removed the job: > > 13/04/12 16:27:16 INFO mapred.JobTracker: Killing job job_201304121621_0001 > 13/04/12 16:27:16 INFO mapred.JobInProgress: Killing job > 'job_201304121621_0001'* > > Looking at the JobTracker code, it seems like this indeed kills the job: > > The code that prints the log line in JobTracker: > private synchronized void killJob(JobInProgress job) { > LOG.info("Killing job " + job.getJobID()); > JobStatus prevStatus = (JobStatus)job.getStatus().clone(); > job.kill(); > > // Inform the listeners if the job is killed > // Note : > // If the job is killed in the PREP state then the listeners will be > // invoked > // If the job is killed in the RUNNING state then cleanup tasks will > be > // launched and the updateTaskStatuses() will take care of it > JobStatus newStatus = (JobStatus)job.getStatus().clone(); > if (prevStatus.getRunState() != newStatus.getRunState() > && newStatus.getRunState() == JobStatus.KILLED) { > JobStatusChangeEvent event = > new JobStatusChangeEvent(job, EventType.RUN_STATE_CHANGED, > prevStatus, > newStatus); > updateJobInProgressListeners(event); > } > } > > Then JobInProgress.kill(): > /** > * Kill the job and all its component tasks. This method should be called > from > * jobtracker and should return fast as it locks the jobtracker. > */ > public void kill() { > boolean killNow = false; > synchronized(jobInitKillStatus) { > jobInitKillStatus.killed = true; > //if not in middle of init, terminate it now > if(!jobInitKillStatus.initStarted || jobInitKillStatus.initDone) { > //avoiding nested locking by setting flag > killNow = true; > } > } > if(killNow) { > terminate(JobStatus.KILLED); > } > } > > I don't really see the issue at this point, so any further information / > debugging on your end that reveals this bug would be very valuable to us. > I am trying to debug this on my laptop. I found when I issue kill cmd to jobtracker, the job status is never changed. It is always *PREP*. Can you reproduce this on your machine when you follow the steps I mentioned in the previous mail? > > Lastly, what version of the code are you running? If you're running off > trunk, when did you last update it? > I am running with the code in the trunk, it is updated last week. > > Ben > > > On Fri, Apr 12, 2013 at 1:32 AM, 王国栋 <[email protected]> wrote: > > > Hi Vinod, > > > > When I submit the job, the log of jobtracker is as follow > > ---------------------------------- > > 13/04/12 16:22:37 INFO mapred.MesosScheduler: Added job > > job_201304121621_0001 > > 13/04/12 16:22:37 INFO mapred.JobTracker: Job job_201304121621_0001 added > > successfully for user 'guodong' to queue 'default' > > 13/04/12 16:22:37 INFO mapred.JobTracker: Initializing > > job_201304121621_0001 > > 13/04/12 16:22:37 INFO mapred.JobInProgress: Initializing > > job_201304121621_0001 > > 13/04/12 16:22:37 INFO mapred.AuditLogger: USER=guodong IP=127.0.0.1 > > OPERATION=SUBMIT_JOB TARGET=job_201304121621_0001 RESULT=SUCCESS > > 13/04/12 16:22:37 INFO mapred.JobInProgress: jobToken generated and > stored > > with users keys in > > /tmp/hadoop-guodong/mapred/system/job_201304121621_0001/jobToken > > 13/04/12 16:22:37 INFO mapred.JobInProgress: Input size for job > > job_201304121621_0001 = 89502988. Number of splits = 3 > > 13/04/12 16:22:37 INFO net.NetworkTopology: Adding a new node: > > /default-rack/localhost > > 13/04/12 16:22:37 INFO mapred.JobInProgress: > > tip:task_201304121621_0001_m_000000 has split on > > node:/default-rack/localhost > > 13/04/12 16:22:37 INFO mapred.JobInProgress: > > tip:task_201304121621_0001_m_000001 has split on > > node:/default-rack/localhost > > 13/04/12 16:22:37 INFO mapred.JobInProgress: > > tip:task_201304121621_0001_m_000002 has split on > > node:/default-rack/localhost > > 13/04/12 16:22:37 INFO mapred.JobInProgress: job_201304121621_0001 > > LOCALITY_WAIT_FACTOR=1.0 > > 13/04/12 16:22:37 INFO mapred.JobInProgress: Job job_201304121621_0001 > > initialized successfully with 3 map tasks and 1 reduce tasks. > > 13/04/12 16:22:39 INFO mapred.MesosScheduler: JobTracker Status > > Pending Map Tasks: 3 > > Pending Reduce Tasks: 1 > > Idle Map Slots: 0 > > Idle Reduce Slots: 0 > > Inactive Map Slots: 0 (launched but no hearbeat yet) > > Inactive Reduce Slots: 0 (launched but no hearbeat yet) > > Needed Map Slots: 3 > > Needed Reduce Slots: 1 > > 13/04/12 16:22:39 INFO mapred.MesosScheduler: Declining offer with > > insufficient resources for a TaskTracker: > > cpus: offered 4.0 needed 1.800000011920929 > > mem : offered 2731.0 needed 6400.0 > > disk: offered 75120.0 needed 4096.0 > > ports: at least 2 (sufficient) > > [name: "cpus" > > type: SCALAR > > scalar { > > value: 4.0 > > } > > , name: "mem" > > type: SCALAR > > scalar { > > value: 2731.0 > > } > > , name: "ports" > > type: RANGES > > ranges { > > range { > > begin: 31000 > > end: 32000 > > } > > } > > , name: "disk" > > type: SCALAR > > scalar { > > value: 75120.0 > > } > > ] > > 13/04/12 16:22:39 INFO mapred.MesosScheduler: Unable to fully satisfy > > needed map/reduce slots: 3 map slots 1 reduce slots remaining > > 13/04/12 16:22:45 INFO mapred.MesosScheduler: JobTracker Status > > Pending Map Tasks: 3 > > Pending Reduce Tasks: 1 > > Idle Map Slots: 0 > > Idle Reduce Slots: 0 > > Inactive Map Slots: 0 (launched but no hearbeat yet) > > Inactive Reduce Slots: 0 (launched but no hearbeat yet) > > Needed Map Slots: 3 > > Needed Reduce Slots: 1 > > 13/04/12 16:22:45 INFO mapred.MesosScheduler: Declining offer with > > insufficient resources for a TaskTracker: > > cpus: offered 4.0 needed 1.800000011920929 > > mem : offered 2731.0 needed 6400.0 > > disk: offered 75120.0 needed 4096.0 > > ports: at least 2 (sufficient) > > [name: "cpus" > > type: SCALAR > > scalar { > > value: 4.0 > > } > > , name: "mem" > > type: SCALAR > > scalar { > > value: 2731.0 > > } > > , name: "ports" > > type: RANGES > > ranges { > > range { > > begin: 31000 > > end: 32000 > > } > > } > > , name: "disk" > > type: SCALAR > > scalar { > > value: 75120.0 > > } > > ] > > 13/04/12 16:22:45 INFO mapred.MesosScheduler: Unable to fully satisfy > > needed map/reduce slots: 3 map slots 1 reduce slots remaining > > > > ---------------------------------- > > > > the hadoop client log is > > -------------------------------------------- > > 13/04/12 16:22:36 INFO security.UserGroupInformation: JAAS Configuration > > already set up for Hadoop, not re-installing. > > 13/04/12 16:22:36 INFO util.NativeCodeLoader: Loaded the native-hadoop > > library > > 13/04/12 16:22:37 INFO input.FileInputFormat: Total input paths to > process > > : 1 > > 13/04/12 16:22:37 WARN snappy.LoadSnappy: Snappy native library is > > available > > 13/04/12 16:22:37 INFO snappy.LoadSnappy: Snappy native library loaded > > 13/04/12 16:22:37 INFO mapred.JobClient: Running job: > job_201304121621_0001 > > 13/04/12 16:22:38 INFO mapred.JobClient: map 0% reduce 0% > > > > -------------------------------------------- > > > > Since the client is hung up, I use ctrl-c to stop the client. Then use > job > > -status to check the job status. > > > > guodong@guodong-Vostro-3400 > > :~/workspace/mesos-trunk/build/hadoop/hadoop-0.20.2-cdh3u3 > > [trunk ?]$ ./bin/hadoop job -status job_201304121621_0001 > > 13/04/12 16:26:22 INFO security.UserGroupInformation: JAAS Configuration > > already set up for Hadoop, not re-installing. > > > > Job: job_201304121621_0001 > > file: > > > > > file:/tmp/hadoop-guodong/mapred/staging/guodong/.staging/job_201304121621_0001/job.xml > > tracking URL: > > http://localhost:50030/jobdetails.jsp?jobid=job_201304121621_0001 > > map() completion: 0.0 > > reduce() completion: 0.0 > > Counters: 0 > > > > Then try to kill the job > > guodong@guodong-Vostro-3400 > > :~/workspace/mesos-trunk/build/hadoop/hadoop-0.20.2-cdh3u3 > > [trunk ?]$ ./bin/hadoop job -kill job_201304121621_000113/04/12 16:27:16 > > INFO security.UserGroupInformation: JAAS Configuration already set up for > > Hadoop, not re-installing. > > Killed job job_201304121621_0001 > > > > when I kill the job, I can see the log on jobtracker > > *13/04/12 16:27:13 INFO mapred.MesosScheduler: Unable to fully satisfy > > needed map/reduce slots: 3 map slots 1 reduce slots remaining* > > *13/04/12 16:27:16 INFO mapred.JobTracker: Killing job > > job_201304121621_0001 > > * > > *13/04/12 16:27:16 INFO mapred.JobInProgress: Killing job > > 'job_201304121621_0001'* > > > > After I kill the job, I can use job-status to check the job status, and > it > > is still pending. > > > > guodong@guodong-Vostro-3400 > > :~/workspace/mesos-trunk/build/hadoop/hadoop-0.20.2-cdh3u3 > > [trunk ?]$ ./bin/hadoop job -status job_201304121621_0001 > > 13/04/12 16:31:09 INFO security.UserGroupInformation: JAAS Configuration > > already set up for Hadoop, not re-installing. > > > > Job: job_201304121621_0001 > > file: > > > > > file:/tmp/hadoop-guodong/mapred/staging/guodong/.staging/job_201304121621_0001/job.xml > > tracking URL: > > http://localhost:50030/jobdetails.jsp?jobid=job_201304121621_0001 > > map() completion: 0.0 > > reduce() completion: 0.0 > > Counters: 0 > > > > > > I hope this is helpful ! > > > > Looking forward to your new progress. And I will also try to go through > > hadoop code to check it again. > > > > Thanks. > > > > > > > > > > Guodong > > > > > > On Fri, Apr 12, 2013 at 3:42 PM, Vinod Kone <[email protected]> wrote: > > > > > Hi Guodong, > > > > > > This is likely a bug on our side. > > > > > > Could you paste the output of the client when you issue the kill? Also, > > the > > > output of the jobtracker (during the submission and kill of the job) > > would > > > also be helpful. > > > > > > Thanks, > > > > > > > > > > > > -- Vinod > > > > > > > > > On Thu, Apr 11, 2013 at 10:43 PM, 王国栋 <[email protected]> wrote: > > > > > > > Hi Ben, > > > > > > > > I am sorry for my mistake about point 2. The jobtracker jetty server > > > works > > > > fine. Yesterday, the execution time for my test job is too short, so > > it > > > is > > > > finished before the jetty server can show the job status in running > > > list. > > > > Today, I try some big job, and the status is perfectly right. > > > > > > > > More information about point 1. I can reproduce this by the following > > > step. > > > > 1. set the needed resource for each slot(memory, cpu) in > > mapred-site.xml. > > > > Make sure each slot need a lot of resource. > > > > 2. Given only one mesos slave whose resource is not enough for one > > mapper > > > > slot and one reduce slot. > > > > 3. run the jobtracker, submit a job which need 1 mapper and 1 > reducer. > > > > > > > > Then , I can find out that the job is pending due to not enough > > > resource. I > > > > can use ctrl-c to stop the hadoop client. But the job is still > pending. > > > > > > > > I try to kill the job with "hadoop job -kill ", but I can not kill > the > > > > pending job. > > > > Before kill the job, I can check the job status is > > > > *Job Setup:* < > > > > > > > > > > http://localhost:50030/jobtasks.jsp?jobid=job_201304121059_0001&type=setup&pagenum=1&state=killed > > > > > > > > > Pending > > > > *Job Cleanup:* Pending* > > > > * > > > > * > > > > * > > > > After I try to kill the job, the job status is > > > > *Job Setup: *failed > > > > *Job Cleanup:* Pending > > > > > > > > Then the job is hang up. And I can never stop it. > > > > > > > > Is it possible that mesos-scheduler miss some killed job event ? > > > > > > > > > > > > > > > > > > > > Guodong > > > > > > > > > > > > On Fri, Apr 12, 2013 at 2:59 AM, Benjamin Mahler > > > > <[email protected]>wrote: > > > > > > > > > 'Pending Map Tasks': The nubmer of pending map tasks in the Hadoop > > > > > JobTracker. > > > > > 'Pending Reduce Tasks': The nubmer of pending reduce tasks in the > > > Hadoop > > > > > JobTracker. > > > > > > > > > > Did you successfully kill the job? If so, did you allow some time > for > > > the > > > > > JobTracker to detect that the job was killed. > > > > > > > > > > Our scheduler (MesosScheduler.java) simply introspects on the > > > JobTracker > > > > > state to determine the number of pending map/reduce tasks in the > > > system. > > > > I > > > > > would expect this to go to 0 sometime after you kill your job, if > > > that's > > > > > not the case, I'll need more information to figure out what's going > > on. > > > > > > > > > > Can you elaborate more on your point 2? It almost sounds like > you're > > > > > talking to a different JobTracker? Note that the MesosScheduler > runs > > > > > *inside* the JobTracker. > > > > > > > > > > Also, note that we recently committed a deadlock fix for the Hadoop > > > > patch: > > > > > https://reviews.apache.org/r/10352/ > > > > > > > > > > > > > > > On Thu, Apr 11, 2013 at 2:31 AM, 王国栋 <[email protected]> wrote: > > > > > > > > > > > Hi, > > > > > > > > > > > > I am trying to run hadoop over mesos. And I am using the code in > > the > > > > > trunk. > > > > > > But I ran into some problems here. My hadoop version is cdh3u3. > > > > > > > > > > > > *1. when a job is pending because of no enough resources, I use > > > ctrl-c > > > > to > > > > > > stop the job client. But I can see the pending mapper and pending > > > > reducer > > > > > > are still in the job tracker. Then I try to use "hadoop job -kill > > > > jobid" > > > > > to > > > > > > kill this job, but nothing happens in jobtracker, mapper and > > reducer > > > > are > > > > > > still pending. The log in jobtracker is as follow.* > > > > > > > > > > > > 13/04/11 17:21:07 INFO mapred.MesosScheduler: JobTracker Status > > > > > > Pending Map Tasks: 1 > > > > > > Pending Reduce Tasks: 1 > > > > > > Idle Map Slots: 0 > > > > > > Idle Reduce Slots: 0 > > > > > > Inactive Map Slots: 0 (launched but no hearbeat yet) > > > > > > Inactive Reduce Slots: 0 (launched but no hearbeat yet) > > > > > > Needed Map Slots: 1 > > > > > > Needed Reduce Slots: 1 > > > > > > 13/04/11 17:21:07 INFO mapred.MesosScheduler: Declining offer > with > > > > > > insufficient resources for a TaskTracker: > > > > > > cpus: offered 4.0 needed 1.800000011920929 > > > > > > mem : offered 2731.0 needed 6432.0 > > > > > > disk: offered 70651.0 needed 4096.0 > > > > > > ports: at least 2 (sufficient) > > > > > > [name: "cpus" > > > > > > type: SCALAR > > > > > > scalar { > > > > > > value: 4.0 > > > > > > } > > > > > > , name: "mem" > > > > > > type: SCALAR > > > > > > scalar { > > > > > > value: 2731.0 > > > > > > } > > > > > > , name: "ports" > > > > > > type: RANGES > > > > > > ranges { > > > > > > range { > > > > > > begin: 31000 > > > > > > end: 32000 > > > > > > } > > > > > > } > > > > > > , name: "disk" > > > > > > type: SCALAR > > > > > > scalar { > > > > > > value: 70651.0 > > > > > > } > > > > > > ] > > > > > > 13/04/11 17:21:07 INFO mapred.MesosScheduler: Unable to fully > > satisfy > > > > > > needed map/reduce slots: 1 map slots 1 reduce slots remaining > > > > > > > > > > > > *2. when we submit the job to the jobtracker, I can not find any > > > > running > > > > > > job on jobtracker web interface( > > > http://localhost:50030/jobtracker.jsp > > > > ). > > > > > > But > > > > > > when the job is finished, I can see the job info in retired job > in > > > > > > jobtracker.* > > > > > > > > > > > > Any ideas about this ? Thanks a lot. > > > > > > > > > > > > Guodong > > > > > > > > > > > > > > > > > > > > >
