Hi Josh, "Jobs" and "Tasks" mean different things in Hadoop. A job is comprised of many tasks (in your case, 344 map tasks + some number of reduce tasks). There are effectively two levels of schedulers at work here. At the highest level, a single job is dequeued off the front of the FIFO list of jobs. Its tasks are then eligible for execution. TaskTrackers will then dequeue individual map tasks out of this job, and execute as many tasks in parallel as they're configured to. So your TaskTracker may actually be running five tasks in parallel -- five map splits -- but as Ashish points out, that may look like it's only "retiring" one task at a time, even though it's got four others in flight simultaneously.
But if you enqueue a second job, it gets in line behind all the tasks associated with the first job. So on the top-level status page, that 2nd job will list as "pending" because the map slots are all occupied by the first job (which is actually most likely running multiple tasks in parallel). Only after all the mappers for the first job are done, will mappers for the second job be considered. If you click on job details for the first (running) job, and then click on "Mappers," it will show you a status line for each mapper running. This can confirm whether you're actually running 1 or 5 mappers concurrently. There are other top-level job schedulers (FairScheduler, CapacityScheduler) which reserve capacity for multiple jobs to execute at the same time, splitting the available task slots between them. Also, in Hive, try: hive> set mapred.reduce.tasks = 5; before you run your SELECT query. While the number of map tasks is auto-determined based on the size of the input, the number of reduce tasks is user-controlled. You may see better reduce performance and more parallelism if you manually set that figure to equal the number of reduce task slots available. - Aaron On Wed, Jan 28, 2009 at 9:52 AM, Josh Ferguson <[email protected]> wrote: > I only have one task tracker right now because I'm just setting up some > testing. But that one machine only runs 1 mapper at a time. In the job > tracker web interface I only ever see 1 job running at a time and no jobs > ever start simultaneously from what I can tell. Is the behavior of a single > task tracker that it can spawn *only* 1 child JVM at a time to do maps for a > single job? How do I get it to spawn 4-6 children for mapping jobs at once? > Josh Ferguson. > > > On Wed, Jan 28, 2009 at 7:38 AM, Ashish Thusoo <[email protected]>wrote: > >> How many nodes do you have in your map/reduce cluster? It could just be >> the case tht the cluster does not have enough map slots so all 344 maps >> cnnot be run simultaneously. Suppose you had a 4 node cluster. Then by your >> configuration you would have a total of 20 map slots. So you would see 20 >> mappers started off and then you as each mapper finishes another would move >> from pending to started. This could give an illusion that mappers are >> running one at a time, though at anytime 20 are running concurrently.. >> >> Also you could potentially decrease the number of mappers being run by >> setting mapred.min.split.size. >> >> Ashish >> >> ________________________________________ >> From: Josh Ferguson [[email protected]] >> Sent: Tuesday, January 27, 2009 9:20 PM >> To: [email protected] >> Subject: Number of tasks >> >> Ok so I'm experimenting with the slow running hive query I was having >> earlier. It was indeed only processing one map task at a time even >> though I *think* I told it to do more. Anyone who is good with hadoop >> feel free to speak up here as well, this is my first foray into trying >> to setup jobs for production. Here is the relevant configuration used >> on the job tracker and task tracker machines. >> >> <property> >> <name>mapred.map.tasks</name> >> <value>7</value> >> <description>The default number of map tasks per job. Typically >> set >> to a prime several times greater than number of available hosts. >> Ignored when mapred.job.tracker is "local". >> </description> >> </property> >> >> <property> >> <name>mapred.reduce.parallel.copies</name> >> <value>20</value> >> <description>The default number of parallel transfers run by reduce >> during the copy(shuffle) phase. >> </description> >> </property> >> >> <property> >> <name>mapred.tasktracker.map.tasks.maximum</name> >> <value>5</value> >> <description>The maximum number of map tasks that will be run >> simultaneously by a task tracker. >> </description> >> </property> >> >> <property> >> <name>mapred.tasktracker.reduce.tasks.maximum</name> >> <value>5</value> >> <description>The maximum number of reduce tasks that will be run >> simultaneously by a task tracker. >> </description> >> </property> >> >> The query was SELECT COUNT(DISTINCT(table.field)) FROM table; >> >> Anyone know why this might only be running one map task at a time? >> Takes about 5 minutes to go through 344 of them at this rate. >> >> Josh Ferguson >> > >
