Thanks for digging through the code. Your analysis makes sense. Though it is strange that Hadoop wants to run a cleanup task for a job that was never launched. Maybe you could file a bug for Hadoop?
@vinodkone Sent from my mobile On Apr 17, 2013, at 2:30 AM, 王国栋 <[email protected]> wrote: > Hi Ben, > > I have read some of the code about job tracker and task tracker these 2 > days. > > I don't think this scenario is a bug. Neither a bug in mesos scheduler nor > a bug in jobtracker. The following is my opinion. > > We kill the job when the job status is PREP( which means setup task is not > launched yet). When we kill the job, jobtracker just mark 2 boolean values > in JobInProgress as true, which means this job is killed and failed. But > killing a job will not trigger any job status change until the job cleanup > task is scheduled to a tasktracker. > > The job status stays in PREP just because mesos can not launch a > tasktracker due to not enough resource. So it is a little bit embarrassing. > When there is enough resource for mesos to launch a tasktracker(say, there > is some other mesos-slave who can offer more enough to start a task > tracker), mesos will launch a task tracker and the job cleanup task will be > scheduled to this task tracker. And at last, after the clean up task is > completed, the job status will be updated as *KILLED* and the pending > mapper slot and pending reduce slot will be released too. > > If anything is wrong, please correct me. :) > > > Guodong > > > On Tue, Apr 16, 2013 at 1:02 AM, Benjamin Mahler > <[email protected]>wrote: > >> Thanks! It would be helpful to me if you could provide the exact commands >> you're running for steps 8, 9 and 10 as well. It would save me some time as >> my knowledge of Hadoop commands is limited. >> >> >> On Sun, Apr 14, 2013 at 9:53 PM, 王国栋 <[email protected]> wrote: >> >>> Hi ben, >>> >>> I've updated to the latest code in trunk. And the problem is still here. >>> >>> Please follow these steps to reproduce it. >>> 1. check out the trunk code >>> 2. bootstrap >>> 3. mkdir build for out source build >>> 4. cd build && ../configure >>> 5. make >>> 6. cd hadoop && make hadoop-0.20.2-cdh3u3 >>> 7. modify conf/mapred-site.xml, set mapred.mesos.slot.mem=10240(make sure >>> the resource is not enough) >>> 8. start mesos and the jobtracker >>> 9. submit a wordcount job to jobtracker.(at this time, the job is pending >>> due to not enough resource) >>> 10. kill the job (can not kill the job) >>> >>> >>> Guodong >>> >>> >>> On Mon, Apr 15, 2013 at 10:52 AM, 王国栋 <[email protected]> wrote: >>> >>>> I am building the latest code in the trunk. I will keep you updated, if >>>> the problem is still here, I will give you the exact reproduce process >>> and >>>> make sure you can debug it. >>>> >>>> I also think this may be a bug in jobtracker. :) >>>> >>>> Thanks. >>>> >>>> Guodong >>>> >>>> >>>> On Mon, Apr 15, 2013 at 10:20 AM, Benjamin Mahler < >>>> [email protected]> wrote: >>>> >>>>> On April 9th we submitted a deadlock fix, please update to make sure >> you >>>>> have the fix: >>>>> https://reviews.apache.org/r/10352/ >>>>> >>>>> Can you provide the commands to reproduce from a mesos build >> directory? >>>>> I'd >>>>> like to be able to reproduce this locally on my laptop, so the exact >>>>> commands I need to run from within my mesos build directory would be >>>>> useful >>>>> :) >>>>> >>>>> The fact that the job gets stuck in the JobTracker as PREP makes me >>> think >>>>> there's a bug in the JobTracker, I imagine people don't often run >>>>> JobTrackers without any TaskTrackers, which is not the case here. >>>>> >>>>> >>>>> On Sun, Apr 14, 2013 at 7:04 PM, 王国栋 <[email protected]> wrote: >>>>> >>>>>> Hi Ben, >>>>>> >>>>>> I put my ideas inline. Please check. >>>>>> >>>>>> On Mon, Apr 15, 2013 at 8:13 AM, Benjamin Mahler >>>>>> <[email protected]>wrote: >>>>>> >>>>>>> So I'm still a little confused here. From what you showed, it >> looks >>>>> like >>>>>>> the 'job -kill' command you posted succeeded on the client side? >>>>>>> >>>>>>> [trunk ?]$ ./bin/hadoop job -kill job_201304121621_000113/04/12 >>>>> 16:27:16 >>>>>>> INFO security.UserGroupInformation: JAAS Configuration already set >>> up >>>>> for >>>>>>> Hadoop, not re-installing. >>>>>>> *Killed job job_201304121621_0001** >>>>>>> * >>>>>> Yes, from the client side, killing the job is successful. >>>>>> >>>>>> >>>>>>> >>>>>>> I see that the JobTracker still thinks the job is pending. What >>>>> happens >>>>>>> when you re-issue that kill command? >>>>>> The Jobtracker still think the status of job is *PREP*, when I >> reissue >>>>> kill >>>>>> cmd, it seems the same as the first time I issued the cmd. >>>>>> >>>>>> >>>>>>> >>>>>>> I'm confused as to why it's still in pending when the JobTracker >> has >>>>>>> removed the job: >>>>>>> >>>>>>> 13/04/12 16:27:16 INFO mapred.JobTracker: Killing job >>>>>> job_201304121621_0001 >>>>>>> 13/04/12 16:27:16 INFO mapred.JobInProgress: Killing job >>>>>>> 'job_201304121621_0001'* >>>>>>> >>>>>>> Looking at the JobTracker code, it seems like this indeed kills >> the >>>>> job: >>>>>>> >>>>>>> The code that prints the log line in JobTracker: >>>>>>> private synchronized void killJob(JobInProgress job) { >>>>>>> LOG.info("Killing job " + job.getJobID()); >>>>>>> JobStatus prevStatus = (JobStatus)job.getStatus().clone(); >>>>>>> job.kill(); >>>>>>> >>>>>>> // Inform the listeners if the job is killed >>>>>>> // Note : >>>>>>> // If the job is killed in the PREP state then the listeners >>>>> will >>>>>> be >>>>>>> // invoked >>>>>>> // If the job is killed in the RUNNING state then cleanup >>> tasks >>>>>> will >>>>>>> be >>>>>>> // launched and the updateTaskStatuses() will take care of >> it >>>>>>> JobStatus newStatus = (JobStatus)job.getStatus().clone(); >>>>>>> if (prevStatus.getRunState() != newStatus.getRunState() >>>>>>> && newStatus.getRunState() == JobStatus.KILLED) { >>>>>>> JobStatusChangeEvent event = >>>>>>> new JobStatusChangeEvent(job, EventType.RUN_STATE_CHANGED, >>>>>>> prevStatus, >>>>>>> newStatus); >>>>>>> updateJobInProgressListeners(event); >>>>>>> } >>>>>>> } >>>>>>> >>>>>>> Then JobInProgress.kill(): >>>>>>> /** >>>>>>> * Kill the job and all its component tasks. This method should >> be >>>>>> called >>>>>>> from >>>>>>> * jobtracker and should return fast as it locks the jobtracker. >>>>>>> */ >>>>>>> public void kill() { >>>>>>> boolean killNow = false; >>>>>>> synchronized(jobInitKillStatus) { >>>>>>> jobInitKillStatus.killed = true; >>>>>>> //if not in middle of init, terminate it now >>>>>>> if(!jobInitKillStatus.initStarted || >>>>> jobInitKillStatus.initDone) { >>>>>>> //avoiding nested locking by setting flag >>>>>>> killNow = true; >>>>>>> } >>>>>>> } >>>>>>> if(killNow) { >>>>>>> terminate(JobStatus.KILLED); >>>>>>> } >>>>>>> } >>>>>>> >>>>>>> I don't really see the issue at this point, so any further >>>>> information / >>>>>>> debugging on your end that reveals this bug would be very valuable >>> to >>>>> us. >>>>>> I am trying to debug this on my laptop. I found when I issue kill >> cmd >>> to >>>>>> jobtracker, the job status is never changed. It is always *PREP*. >>>>>> Can you reproduce this on your machine when you follow the steps I >>>>>> mentioned in the previous mail? >>>>>> >>>>>> >>>>>>> >>>>>>> Lastly, what version of the code are you running? If you're >> running >>>>> off >>>>>>> trunk, when did you last update it? >>>>>> >>>>>> I am running with the code in the trunk, it is updated last week. >>>>>> >>>>>>> >>>>>>> Ben >>>>>>> >>>>>>> >>>>>>> On Fri, Apr 12, 2013 at 1:32 AM, 王国栋 <[email protected]> wrote: >>>>>>> >>>>>>>> Hi Vinod, >>>>>>>> >>>>>>>> When I submit the job, the log of jobtracker is as follow >>>>>>>> ---------------------------------- >>>>>>>> 13/04/12 16:22:37 INFO mapred.MesosScheduler: Added job >>>>>>>> job_201304121621_0001 >>>>>>>> 13/04/12 16:22:37 INFO mapred.JobTracker: Job >>> job_201304121621_0001 >>>>>> added >>>>>>>> successfully for user 'guodong' to queue 'default' >>>>>>>> 13/04/12 16:22:37 INFO mapred.JobTracker: Initializing >>>>>>>> job_201304121621_0001 >>>>>>>> 13/04/12 16:22:37 INFO mapred.JobInProgress: Initializing >>>>>>>> job_201304121621_0001 >>>>>>>> 13/04/12 16:22:37 INFO mapred.AuditLogger: USER=guodong >>> IP=127.0.0.1 >>>>>>>> OPERATION=SUBMIT_JOB TARGET=job_201304121621_0001 RESULT=SUCCESS >>>>>>>> 13/04/12 16:22:37 INFO mapred.JobInProgress: jobToken generated >>> and >>>>>>> stored >>>>>>>> with users keys in >>>>>>>> /tmp/hadoop-guodong/mapred/system/job_201304121621_0001/jobToken >>>>>>>> 13/04/12 16:22:37 INFO mapred.JobInProgress: Input size for job >>>>>>>> job_201304121621_0001 = 89502988. Number of splits = 3 >>>>>>>> 13/04/12 16:22:37 INFO net.NetworkTopology: Adding a new node: >>>>>>>> /default-rack/localhost >>>>>>>> 13/04/12 16:22:37 INFO mapred.JobInProgress: >>>>>>>> tip:task_201304121621_0001_m_000000 has split on >>>>>>>> node:/default-rack/localhost >>>>>>>> 13/04/12 16:22:37 INFO mapred.JobInProgress: >>>>>>>> tip:task_201304121621_0001_m_000001 has split on >>>>>>>> node:/default-rack/localhost >>>>>>>> 13/04/12 16:22:37 INFO mapred.JobInProgress: >>>>>>>> tip:task_201304121621_0001_m_000002 has split on >>>>>>>> node:/default-rack/localhost >>>>>>>> 13/04/12 16:22:37 INFO mapred.JobInProgress: >> job_201304121621_0001 >>>>>>>> LOCALITY_WAIT_FACTOR=1.0 >>>>>>>> 13/04/12 16:22:37 INFO mapred.JobInProgress: Job >>>>> job_201304121621_0001 >>>>>>>> initialized successfully with 3 map tasks and 1 reduce tasks. >>>>>>>> 13/04/12 16:22:39 INFO mapred.MesosScheduler: JobTracker Status >>>>>>>> Pending Map Tasks: 3 >>>>>>>> Pending Reduce Tasks: 1 >>>>>>>> Idle Map Slots: 0 >>>>>>>> Idle Reduce Slots: 0 >>>>>>>> Inactive Map Slots: 0 (launched but no hearbeat yet) >>>>>>>> Inactive Reduce Slots: 0 (launched but no hearbeat yet) >>>>>>>> Needed Map Slots: 3 >>>>>>>> Needed Reduce Slots: 1 >>>>>>>> 13/04/12 16:22:39 INFO mapred.MesosScheduler: Declining offer >> with >>>>>>>> insufficient resources for a TaskTracker: >>>>>>>> cpus: offered 4.0 needed 1.800000011920929 >>>>>>>> mem : offered 2731.0 needed 6400.0 >>>>>>>> disk: offered 75120.0 needed 4096.0 >>>>>>>> ports: at least 2 (sufficient) >>>>>>>> [name: "cpus" >>>>>>>> type: SCALAR >>>>>>>> scalar { >>>>>>>> value: 4.0 >>>>>>>> } >>>>>>>> , name: "mem" >>>>>>>> type: SCALAR >>>>>>>> scalar { >>>>>>>> value: 2731.0 >>>>>>>> } >>>>>>>> , name: "ports" >>>>>>>> type: RANGES >>>>>>>> ranges { >>>>>>>> range { >>>>>>>> begin: 31000 >>>>>>>> end: 32000 >>>>>>>> } >>>>>>>> } >>>>>>>> , name: "disk" >>>>>>>> type: SCALAR >>>>>>>> scalar { >>>>>>>> value: 75120.0 >>>>>>>> } >>>>>>>> ] >>>>>>>> 13/04/12 16:22:39 INFO mapred.MesosScheduler: Unable to fully >>>>> satisfy >>>>>>>> needed map/reduce slots: 3 map slots 1 reduce slots remaining >>>>>>>> 13/04/12 16:22:45 INFO mapred.MesosScheduler: JobTracker Status >>>>>>>> Pending Map Tasks: 3 >>>>>>>> Pending Reduce Tasks: 1 >>>>>>>> Idle Map Slots: 0 >>>>>>>> Idle Reduce Slots: 0 >>>>>>>> Inactive Map Slots: 0 (launched but no hearbeat yet) >>>>>>>> Inactive Reduce Slots: 0 (launched but no hearbeat yet) >>>>>>>> Needed Map Slots: 3 >>>>>>>> Needed Reduce Slots: 1 >>>>>>>> 13/04/12 16:22:45 INFO mapred.MesosScheduler: Declining offer >> with >>>>>>>> insufficient resources for a TaskTracker: >>>>>>>> cpus: offered 4.0 needed 1.800000011920929 >>>>>>>> mem : offered 2731.0 needed 6400.0 >>>>>>>> disk: offered 75120.0 needed 4096.0 >>>>>>>> ports: at least 2 (sufficient) >>>>>>>> [name: "cpus" >>>>>>>> type: SCALAR >>>>>>>> scalar { >>>>>>>> value: 4.0 >>>>>>>> } >>>>>>>> , name: "mem" >>>>>>>> type: SCALAR >>>>>>>> scalar { >>>>>>>> value: 2731.0 >>>>>>>> } >>>>>>>> , name: "ports" >>>>>>>> type: RANGES >>>>>>>> ranges { >>>>>>>> range { >>>>>>>> begin: 31000 >>>>>>>> end: 32000 >>>>>>>> } >>>>>>>> } >>>>>>>> , name: "disk" >>>>>>>> type: SCALAR >>>>>>>> scalar { >>>>>>>> value: 75120.0 >>>>>>>> } >>>>>>>> ] >>>>>>>> 13/04/12 16:22:45 INFO mapred.MesosScheduler: Unable to fully >>>>> satisfy >>>>>>>> needed map/reduce slots: 3 map slots 1 reduce slots remaining >>>>>>>> >>>>>>>> ---------------------------------- >>>>>>>> >>>>>>>> the hadoop client log is >>>>>>>> -------------------------------------------- >>>>>>>> 13/04/12 16:22:36 INFO security.UserGroupInformation: JAAS >>>>>> Configuration >>>>>>>> already set up for Hadoop, not re-installing. >>>>>>>> 13/04/12 16:22:36 INFO util.NativeCodeLoader: Loaded the >>>>> native-hadoop >>>>>>>> library >>>>>>>> 13/04/12 16:22:37 INFO input.FileInputFormat: Total input paths >> to >>>>>>> process >>>>>>>> : 1 >>>>>>>> 13/04/12 16:22:37 WARN snappy.LoadSnappy: Snappy native library >> is >>>>>>>> available >>>>>>>> 13/04/12 16:22:37 INFO snappy.LoadSnappy: Snappy native library >>>>> loaded >>>>>>>> 13/04/12 16:22:37 INFO mapred.JobClient: Running job: >>>>>>> job_201304121621_0001 >>>>>>>> 13/04/12 16:22:38 INFO mapred.JobClient: map 0% reduce 0% >>>>>>>> >>>>>>>> -------------------------------------------- >>>>>>>> >>>>>>>> Since the client is hung up, I use ctrl-c to stop the client. >>> Then >>>>> use >>>>>>> job >>>>>>>> -status to check the job status. >>>>>>>> >>>>>>>> guodong@guodong-Vostro-3400 >>>>>>>> :~/workspace/mesos-trunk/build/hadoop/hadoop-0.20.2-cdh3u3 >>>>>>>> [trunk ?]$ ./bin/hadoop job -status job_201304121621_0001 >>>>>>>> 13/04/12 16:26:22 INFO security.UserGroupInformation: JAAS >>>>>> Configuration >>>>>>>> already set up for Hadoop, not re-installing. >>>>>>>> >>>>>>>> Job: job_201304121621_0001 >>>>>>>> file: >> file:/tmp/hadoop-guodong/mapred/staging/guodong/.staging/job_201304121621_0001/job.xml >>>>>>>> tracking URL: >> http://localhost:50030/jobdetails.jsp?jobid=job_201304121621_0001 >>>>>>>> map() completion: 0.0 >>>>>>>> reduce() completion: 0.0 >>>>>>>> Counters: 0 >>>>>>>> >>>>>>>> Then try to kill the job >>>>>>>> guodong@guodong-Vostro-3400 >>>>>>>> :~/workspace/mesos-trunk/build/hadoop/hadoop-0.20.2-cdh3u3 >>>>>>>> [trunk ?]$ ./bin/hadoop job -kill job_201304121621_000113/04/12 >>>>>> 16:27:16 >>>>>>>> INFO security.UserGroupInformation: JAAS Configuration already >> set >>>>> up >>>>>> for >>>>>>>> Hadoop, not re-installing. >>>>>>>> Killed job job_201304121621_0001 >>>>>>>> >>>>>>>> when I kill the job, I can see the log on jobtracker >>>>>>>> *13/04/12 16:27:13 INFO mapred.MesosScheduler: Unable to fully >>>>> satisfy >>>>>>>> needed map/reduce slots: 3 map slots 1 reduce slots remaining* >>>>>>>> *13/04/12 16:27:16 INFO mapred.JobTracker: Killing job >>>>>>>> job_201304121621_0001 >>>>>>>> *
