I am building the latest code in the trunk. I will keep you updated, if the problem is still here, I will give you the exact reproduce process and make sure you can debug it.
I also think this may be a bug in jobtracker. :) Thanks. Guodong On Mon, Apr 15, 2013 at 10:20 AM, Benjamin Mahler <[email protected] > wrote: > On April 9th we submitted a deadlock fix, please update to make sure you > have the fix: > https://reviews.apache.org/r/10352/ > > Can you provide the commands to reproduce from a mesos build directory? I'd > like to be able to reproduce this locally on my laptop, so the exact > commands I need to run from within my mesos build directory would be useful > :) > > The fact that the job gets stuck in the JobTracker as PREP makes me think > there's a bug in the JobTracker, I imagine people don't often run > JobTrackers without any TaskTrackers, which is not the case here. > > > On Sun, Apr 14, 2013 at 7:04 PM, 王国栋 <[email protected]> wrote: > > > Hi Ben, > > > > I put my ideas inline. Please check. > > > > On Mon, Apr 15, 2013 at 8:13 AM, Benjamin Mahler > > <[email protected]>wrote: > > > > > So I'm still a little confused here. From what you showed, it looks > like > > > the 'job -kill' command you posted succeeded on the client side? > > > > > > [trunk ?]$ ./bin/hadoop job -kill job_201304121621_000113/04/12 > 16:27:16 > > > INFO security.UserGroupInformation: JAAS Configuration already set up > for > > > Hadoop, not re-installing. > > > *Killed job job_201304121621_0001** > > > * > > > > > Yes, from the client side, killing the job is successful. > > > > > > > > > > I see that the JobTracker still thinks the job is pending. What happens > > > when you re-issue that kill command? > > > > > The Jobtracker still think the status of job is *PREP*, when I reissue > kill > > cmd, it seems the same as the first time I issued the cmd. > > > > > > > > > > I'm confused as to why it's still in pending when the JobTracker has > > > removed the job: > > > > > > 13/04/12 16:27:16 INFO mapred.JobTracker: Killing job > > job_201304121621_0001 > > > 13/04/12 16:27:16 INFO mapred.JobInProgress: Killing job > > > 'job_201304121621_0001'* > > > > > > Looking at the JobTracker code, it seems like this indeed kills the > job: > > > > > > The code that prints the log line in JobTracker: > > > private synchronized void killJob(JobInProgress job) { > > > LOG.info("Killing job " + job.getJobID()); > > > JobStatus prevStatus = (JobStatus)job.getStatus().clone(); > > > job.kill(); > > > > > > // Inform the listeners if the job is killed > > > // Note : > > > // If the job is killed in the PREP state then the listeners will > > be > > > // invoked > > > // If the job is killed in the RUNNING state then cleanup tasks > > will > > > be > > > // launched and the updateTaskStatuses() will take care of it > > > JobStatus newStatus = (JobStatus)job.getStatus().clone(); > > > if (prevStatus.getRunState() != newStatus.getRunState() > > > && newStatus.getRunState() == JobStatus.KILLED) { > > > JobStatusChangeEvent event = > > > new JobStatusChangeEvent(job, EventType.RUN_STATE_CHANGED, > > > prevStatus, > > > newStatus); > > > updateJobInProgressListeners(event); > > > } > > > } > > > > > > Then JobInProgress.kill(): > > > /** > > > * Kill the job and all its component tasks. This method should be > > called > > > from > > > * jobtracker and should return fast as it locks the jobtracker. > > > */ > > > public void kill() { > > > boolean killNow = false; > > > synchronized(jobInitKillStatus) { > > > jobInitKillStatus.killed = true; > > > //if not in middle of init, terminate it now > > > if(!jobInitKillStatus.initStarted || jobInitKillStatus.initDone) > { > > > //avoiding nested locking by setting flag > > > killNow = true; > > > } > > > } > > > if(killNow) { > > > terminate(JobStatus.KILLED); > > > } > > > } > > > > > > I don't really see the issue at this point, so any further information > / > > > debugging on your end that reveals this bug would be very valuable to > us. > > > > > I am trying to debug this on my laptop. I found when I issue kill cmd to > > jobtracker, the job status is never changed. It is always *PREP*. > > Can you reproduce this on your machine when you follow the steps I > > mentioned in the previous mail? > > > > > > > > > > Lastly, what version of the code are you running? If you're running off > > > trunk, when did you last update it? > > > > > > > I am running with the code in the trunk, it is updated last week. > > > > > > > > Ben > > > > > > > > > On Fri, Apr 12, 2013 at 1:32 AM, 王国栋 <[email protected]> wrote: > > > > > > > Hi Vinod, > > > > > > > > When I submit the job, the log of jobtracker is as follow > > > > ---------------------------------- > > > > 13/04/12 16:22:37 INFO mapred.MesosScheduler: Added job > > > > job_201304121621_0001 > > > > 13/04/12 16:22:37 INFO mapred.JobTracker: Job job_201304121621_0001 > > added > > > > successfully for user 'guodong' to queue 'default' > > > > 13/04/12 16:22:37 INFO mapred.JobTracker: Initializing > > > > job_201304121621_0001 > > > > 13/04/12 16:22:37 INFO mapred.JobInProgress: Initializing > > > > job_201304121621_0001 > > > > 13/04/12 16:22:37 INFO mapred.AuditLogger: USER=guodong IP=127.0.0.1 > > > > OPERATION=SUBMIT_JOB TARGET=job_201304121621_0001 RESULT=SUCCESS > > > > 13/04/12 16:22:37 INFO mapred.JobInProgress: jobToken generated and > > > stored > > > > with users keys in > > > > /tmp/hadoop-guodong/mapred/system/job_201304121621_0001/jobToken > > > > 13/04/12 16:22:37 INFO mapred.JobInProgress: Input size for job > > > > job_201304121621_0001 = 89502988. Number of splits = 3 > > > > 13/04/12 16:22:37 INFO net.NetworkTopology: Adding a new node: > > > > /default-rack/localhost > > > > 13/04/12 16:22:37 INFO mapred.JobInProgress: > > > > tip:task_201304121621_0001_m_000000 has split on > > > > node:/default-rack/localhost > > > > 13/04/12 16:22:37 INFO mapred.JobInProgress: > > > > tip:task_201304121621_0001_m_000001 has split on > > > > node:/default-rack/localhost > > > > 13/04/12 16:22:37 INFO mapred.JobInProgress: > > > > tip:task_201304121621_0001_m_000002 has split on > > > > node:/default-rack/localhost > > > > 13/04/12 16:22:37 INFO mapred.JobInProgress: job_201304121621_0001 > > > > LOCALITY_WAIT_FACTOR=1.0 > > > > 13/04/12 16:22:37 INFO mapred.JobInProgress: Job > job_201304121621_0001 > > > > initialized successfully with 3 map tasks and 1 reduce tasks. > > > > 13/04/12 16:22:39 INFO mapred.MesosScheduler: JobTracker Status > > > > Pending Map Tasks: 3 > > > > Pending Reduce Tasks: 1 > > > > Idle Map Slots: 0 > > > > Idle Reduce Slots: 0 > > > > Inactive Map Slots: 0 (launched but no hearbeat yet) > > > > Inactive Reduce Slots: 0 (launched but no hearbeat yet) > > > > Needed Map Slots: 3 > > > > Needed Reduce Slots: 1 > > > > 13/04/12 16:22:39 INFO mapred.MesosScheduler: Declining offer with > > > > insufficient resources for a TaskTracker: > > > > cpus: offered 4.0 needed 1.800000011920929 > > > > mem : offered 2731.0 needed 6400.0 > > > > disk: offered 75120.0 needed 4096.0 > > > > ports: at least 2 (sufficient) > > > > [name: "cpus" > > > > type: SCALAR > > > > scalar { > > > > value: 4.0 > > > > } > > > > , name: "mem" > > > > type: SCALAR > > > > scalar { > > > > value: 2731.0 > > > > } > > > > , name: "ports" > > > > type: RANGES > > > > ranges { > > > > range { > > > > begin: 31000 > > > > end: 32000 > > > > } > > > > } > > > > , name: "disk" > > > > type: SCALAR > > > > scalar { > > > > value: 75120.0 > > > > } > > > > ] > > > > 13/04/12 16:22:39 INFO mapred.MesosScheduler: Unable to fully satisfy > > > > needed map/reduce slots: 3 map slots 1 reduce slots remaining > > > > 13/04/12 16:22:45 INFO mapred.MesosScheduler: JobTracker Status > > > > Pending Map Tasks: 3 > > > > Pending Reduce Tasks: 1 > > > > Idle Map Slots: 0 > > > > Idle Reduce Slots: 0 > > > > Inactive Map Slots: 0 (launched but no hearbeat yet) > > > > Inactive Reduce Slots: 0 (launched but no hearbeat yet) > > > > Needed Map Slots: 3 > > > > Needed Reduce Slots: 1 > > > > 13/04/12 16:22:45 INFO mapred.MesosScheduler: Declining offer with > > > > insufficient resources for a TaskTracker: > > > > cpus: offered 4.0 needed 1.800000011920929 > > > > mem : offered 2731.0 needed 6400.0 > > > > disk: offered 75120.0 needed 4096.0 > > > > ports: at least 2 (sufficient) > > > > [name: "cpus" > > > > type: SCALAR > > > > scalar { > > > > value: 4.0 > > > > } > > > > , name: "mem" > > > > type: SCALAR > > > > scalar { > > > > value: 2731.0 > > > > } > > > > , name: "ports" > > > > type: RANGES > > > > ranges { > > > > range { > > > > begin: 31000 > > > > end: 32000 > > > > } > > > > } > > > > , name: "disk" > > > > type: SCALAR > > > > scalar { > > > > value: 75120.0 > > > > } > > > > ] > > > > 13/04/12 16:22:45 INFO mapred.MesosScheduler: Unable to fully satisfy > > > > needed map/reduce slots: 3 map slots 1 reduce slots remaining > > > > > > > > ---------------------------------- > > > > > > > > the hadoop client log is > > > > -------------------------------------------- > > > > 13/04/12 16:22:36 INFO security.UserGroupInformation: JAAS > > Configuration > > > > already set up for Hadoop, not re-installing. > > > > 13/04/12 16:22:36 INFO util.NativeCodeLoader: Loaded the > native-hadoop > > > > library > > > > 13/04/12 16:22:37 INFO input.FileInputFormat: Total input paths to > > > process > > > > : 1 > > > > 13/04/12 16:22:37 WARN snappy.LoadSnappy: Snappy native library is > > > > available > > > > 13/04/12 16:22:37 INFO snappy.LoadSnappy: Snappy native library > loaded > > > > 13/04/12 16:22:37 INFO mapred.JobClient: Running job: > > > job_201304121621_0001 > > > > 13/04/12 16:22:38 INFO mapred.JobClient: map 0% reduce 0% > > > > > > > > -------------------------------------------- > > > > > > > > Since the client is hung up, I use ctrl-c to stop the client. Then > use > > > job > > > > -status to check the job status. > > > > > > > > guodong@guodong-Vostro-3400 > > > > :~/workspace/mesos-trunk/build/hadoop/hadoop-0.20.2-cdh3u3 > > > > [trunk ?]$ ./bin/hadoop job -status job_201304121621_0001 > > > > 13/04/12 16:26:22 INFO security.UserGroupInformation: JAAS > > Configuration > > > > already set up for Hadoop, not re-installing. > > > > > > > > Job: job_201304121621_0001 > > > > file: > > > > > > > > > > > > > > file:/tmp/hadoop-guodong/mapred/staging/guodong/.staging/job_201304121621_0001/job.xml > > > > tracking URL: > > > > http://localhost:50030/jobdetails.jsp?jobid=job_201304121621_0001 > > > > map() completion: 0.0 > > > > reduce() completion: 0.0 > > > > Counters: 0 > > > > > > > > Then try to kill the job > > > > guodong@guodong-Vostro-3400 > > > > :~/workspace/mesos-trunk/build/hadoop/hadoop-0.20.2-cdh3u3 > > > > [trunk ?]$ ./bin/hadoop job -kill job_201304121621_000113/04/12 > > 16:27:16 > > > > INFO security.UserGroupInformation: JAAS Configuration already set up > > for > > > > Hadoop, not re-installing. > > > > Killed job job_201304121621_0001 > > > > > > > > when I kill the job, I can see the log on jobtracker > > > > *13/04/12 16:27:13 INFO mapred.MesosScheduler: Unable to fully > satisfy > > > > needed map/reduce slots: 3 map slots 1 reduce slots remaining* > > > > *13/04/12 16:27:16 INFO mapred.JobTracker: Killing job > > > > job_201304121621_0001 > > > > * > > > > *13/04/12 16:27:16 INFO mapred.JobInProgress: Killing job > > > > 'job_201304121621_0001'* > > > > > > > > After I kill the job, I can use job-status to check the job status, > and > > > it > > > > is still pending. > > > > > > > > guodong@guodong-Vostro-3400 > > > > :~/workspace/mesos-trunk/build/hadoop/hadoop-0.20.2-cdh3u3 > > > > [trunk ?]$ ./bin/hadoop job -status job_201304121621_0001 > > > > 13/04/12 16:31:09 INFO security.UserGroupInformation: JAAS > > Configuration > > > > already set up for Hadoop, not re-installing. > > > > > > > > Job: job_201304121621_0001 > > > > file: > > > > > > > > > > > > > > file:/tmp/hadoop-guodong/mapred/staging/guodong/.staging/job_201304121621_0001/job.xml > > > > tracking URL: > > > > http://localhost:50030/jobdetails.jsp?jobid=job_201304121621_0001 > > > > map() completion: 0.0 > > > > reduce() completion: 0.0 > > > > Counters: 0 > > > > > > > > > > > > I hope this is helpful ! > > > > > > > > Looking forward to your new progress. And I will also try to go > through > > > > hadoop code to check it again. > > > > > > > > Thanks. > > > > > > > > > > > > > > > > > > > > Guodong > > > > > > > > > > > > On Fri, Apr 12, 2013 at 3:42 PM, Vinod Kone <[email protected]> > > wrote: > > > > > > > > > Hi Guodong, > > > > > > > > > > This is likely a bug on our side. > > > > > > > > > > Could you paste the output of the client when you issue the kill? > > Also, > > > > the > > > > > output of the jobtracker (during the submission and kill of the > job) > > > > would > > > > > also be helpful. > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > > > > > -- Vinod > > > > > > > > > > > > > > > On Thu, Apr 11, 2013 at 10:43 PM, 王国栋 <[email protected]> wrote: > > > > > > > > > > > Hi Ben, > > > > > > > > > > > > I am sorry for my mistake about point 2. The jobtracker jetty > > server > > > > > works > > > > > > fine. Yesterday, the execution time for my test job is too > short, > > so > > > > it > > > > > is > > > > > > finished before the jetty server can show the job status in > > running > > > > > list. > > > > > > Today, I try some big job, and the status is perfectly right. > > > > > > > > > > > > More information about point 1. I can reproduce this by the > > following > > > > > step. > > > > > > 1. set the needed resource for each slot(memory, cpu) in > > > > mapred-site.xml. > > > > > > Make sure each slot need a lot of resource. > > > > > > 2. Given only one mesos slave whose resource is not enough for > one > > > > mapper > > > > > > slot and one reduce slot. > > > > > > 3. run the jobtracker, submit a job which need 1 mapper and 1 > > > reducer. > > > > > > > > > > > > Then , I can find out that the job is pending due to not enough > > > > > resource. I > > > > > > can use ctrl-c to stop the hadoop client. But the job is still > > > pending. > > > > > > > > > > > > I try to kill the job with "hadoop job -kill ", but I can not > kill > > > the > > > > > > pending job. > > > > > > Before kill the job, I can check the job status is > > > > > > *Job Setup:* < > > > > > > > > > > > > > > > > > > > > > http://localhost:50030/jobtasks.jsp?jobid=job_201304121059_0001&type=setup&pagenum=1&state=killed > > > > > > > > > > > > > Pending > > > > > > *Job Cleanup:* Pending* > > > > > > * > > > > > > * > > > > > > * > > > > > > After I try to kill the job, the job status is > > > > > > *Job Setup: *failed > > > > > > *Job Cleanup:* Pending > > > > > > > > > > > > Then the job is hang up. And I can never stop it. > > > > > > > > > > > > Is it possible that mesos-scheduler miss some killed job event ? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Guodong > > > > > > > > > > > > > > > > > > On Fri, Apr 12, 2013 at 2:59 AM, Benjamin Mahler > > > > > > <[email protected]>wrote: > > > > > > > > > > > > > 'Pending Map Tasks': The nubmer of pending map tasks in the > > Hadoop > > > > > > > JobTracker. > > > > > > > 'Pending Reduce Tasks': The nubmer of pending reduce tasks in > the > > > > > Hadoop > > > > > > > JobTracker. > > > > > > > > > > > > > > Did you successfully kill the job? If so, did you allow some > time > > > for > > > > > the > > > > > > > JobTracker to detect that the job was killed. > > > > > > > > > > > > > > Our scheduler (MesosScheduler.java) simply introspects on the > > > > > JobTracker > > > > > > > state to determine the number of pending map/reduce tasks in > the > > > > > system. > > > > > > I > > > > > > > would expect this to go to 0 sometime after you kill your job, > if > > > > > that's > > > > > > > not the case, I'll need more information to figure out what's > > going > > > > on. > > > > > > > > > > > > > > Can you elaborate more on your point 2? It almost sounds like > > > you're > > > > > > > talking to a different JobTracker? Note that the MesosScheduler > > > runs > > > > > > > *inside* the JobTracker. > > > > > > > > > > > > > > Also, note that we recently committed a deadlock fix for the > > Hadoop > > > > > > patch: > > > > > > > https://reviews.apache.org/r/10352/ > > > > > > > > > > > > > > > > > > > > > On Thu, Apr 11, 2013 at 2:31 AM, 王国栋 <[email protected]> > wrote: > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > I am trying to run hadoop over mesos. And I am using the code > > in > > > > the > > > > > > > trunk. > > > > > > > > But I ran into some problems here. My hadoop version is > cdh3u3. > > > > > > > > > > > > > > > > *1. when a job is pending because of no enough resources, I > use > > > > > ctrl-c > > > > > > to > > > > > > > > stop the job client. But I can see the pending mapper and > > pending > > > > > > reducer > > > > > > > > are still in the job tracker. Then I try to use "hadoop job > > -kill > > > > > > jobid" > > > > > > > to > > > > > > > > kill this job, but nothing happens in jobtracker, mapper and > > > > reducer > > > > > > are > > > > > > > > still pending. The log in jobtracker is as follow.* > > > > > > > > > > > > > > > > 13/04/11 17:21:07 INFO mapred.MesosScheduler: JobTracker > Status > > > > > > > > Pending Map Tasks: 1 > > > > > > > > Pending Reduce Tasks: 1 > > > > > > > > Idle Map Slots: 0 > > > > > > > > Idle Reduce Slots: 0 > > > > > > > > Inactive Map Slots: 0 (launched but no hearbeat yet) > > > > > > > > Inactive Reduce Slots: 0 (launched but no hearbeat yet) > > > > > > > > Needed Map Slots: 1 > > > > > > > > Needed Reduce Slots: 1 > > > > > > > > 13/04/11 17:21:07 INFO mapred.MesosScheduler: Declining offer > > > with > > > > > > > > insufficient resources for a TaskTracker: > > > > > > > > cpus: offered 4.0 needed 1.800000011920929 > > > > > > > > mem : offered 2731.0 needed 6432.0 > > > > > > > > disk: offered 70651.0 needed 4096.0 > > > > > > > > ports: at least 2 (sufficient) > > > > > > > > [name: "cpus" > > > > > > > > type: SCALAR > > > > > > > > scalar { > > > > > > > > value: 4.0 > > > > > > > > } > > > > > > > > , name: "mem" > > > > > > > > type: SCALAR > > > > > > > > scalar { > > > > > > > > value: 2731.0 > > > > > > > > } > > > > > > > > , name: "ports" > > > > > > > > type: RANGES > > > > > > > > ranges { > > > > > > > > range { > > > > > > > > begin: 31000 > > > > > > > > end: 32000 > > > > > > > > } > > > > > > > > } > > > > > > > > , name: "disk" > > > > > > > > type: SCALAR > > > > > > > > scalar { > > > > > > > > value: 70651.0 > > > > > > > > } > > > > > > > > ] > > > > > > > > 13/04/11 17:21:07 INFO mapred.MesosScheduler: Unable to fully > > > > satisfy > > > > > > > > needed map/reduce slots: 1 map slots 1 reduce slots remaining > > > > > > > > > > > > > > > > *2. when we submit the job to the jobtracker, I can not find > > any > > > > > > running > > > > > > > > job on jobtracker web interface( > > > > > http://localhost:50030/jobtracker.jsp > > > > > > ). > > > > > > > > But > > > > > > > > when the job is finished, I can see the job info in retired > job > > > in > > > > > > > > jobtracker.* > > > > > > > > > > > > > > > > Any ideas about this ? Thanks a lot. > > > > > > > > > > > > > > > > Guodong > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
