I also just wanted to add, here are the counters for the one reduce task *File Systems*S3 bytes written124Local bytes read5,667,846Local bytes written5,667,846 *Map-Reduce Framework*Reduce input groups1Combine output records0Reduce output records0Combine input records0Reduce input records104,960
100.00%reduce > reduce 28-Jan-2009 01:40:44 28-Jan-2009 01:46:32 (5mins, 47sec) 8 Seems like a long time for such a small set of data (This set is even more pared down than the previous set I was using). Josh Ferguson On Wed, Jan 28, 2009 at 3:09 PM, Josh Ferguson <[email protected]> wrote: > So I get all of that. I think what is happening is that I'm mapping a lot > of files that are also small in total size (because I have them partitioned) > so that it is automatically choosing to only execute one mapper even though > my machine has a total of 5 available. This is causing the problem that it's > doubling my total map time due to task setup overhead. There is probably > nothing I can do to get around this other than to try and store so many > small files. > As far as the reduce speed I'm not really sure what is causing that to go > so slow. I'm reducing (via hive's count function) 16 MB of files which is > about 144,000 records, it wouldn't take more than 5 seconds to do somewhere > else and it's taking over 18 minutes for my task tracker to do it. I just > have a feeling that I'm doing something terribly wrong to get performance > like that. > Does anyone else have benchmarks on one or two task trackers to do a count > on some rows? > > Josh Ferguson > > > On Wed, Jan 28, 2009 at 11:31 AM, Aaron Kimball <[email protected]>wrote: > >> Hi Josh, >> >> "Jobs" and "Tasks" mean different things in Hadoop. A job is comprised of >> many tasks (in your case, 344 map tasks + some number of reduce tasks). >> There are effectively two levels of schedulers at work here. At the highest >> level, a single job is dequeued off the front of the FIFO list of jobs. Its >> tasks are then eligible for execution. TaskTrackers will then dequeue >> individual map tasks out of this job, and execute as many tasks in parallel >> as they're configured to. So your TaskTracker may actually be running five >> tasks in parallel -- five map splits -- but as Ashish points out, that may >> look like it's only "retiring" one task at a time, even though it's got four >> others in flight simultaneously. >> >> But if you enqueue a second job, it gets in line behind all the tasks >> associated with the first job. So on the top-level status page, that 2nd job >> will list as "pending" because the map slots are all occupied by the first >> job (which is actually most likely running multiple tasks in parallel). Only >> after all the mappers for the first job are done, will mappers for the >> second job be considered. If you click on job details for the first >> (running) job, and then click on "Mappers," it will show you a status line >> for each mapper running. This can confirm whether you're actually running 1 >> or 5 mappers concurrently. >> >> There are other top-level job schedulers (FairScheduler, >> CapacityScheduler) which reserve capacity for multiple jobs to execute at >> the same time, splitting the available task slots between them. >> >> Also, in Hive, try: >> >> hive> set mapred.reduce.tasks = 5; >> >> before you run your SELECT query. While the number of map tasks is >> auto-determined based on the size of the input, the number of reduce tasks >> is user-controlled. You may see better reduce performance and more >> parallelism if you manually set that figure to equal the number of reduce >> task slots available. >> >> - Aaron >> >> >> On Wed, Jan 28, 2009 at 9:52 AM, Josh Ferguson <[email protected]>wrote: >> >>> I only have one task tracker right now because I'm just setting up some >>> testing. But that one machine only runs 1 mapper at a time. In the job >>> tracker web interface I only ever see 1 job running at a time and no jobs >>> ever start simultaneously from what I can tell. Is the behavior of a single >>> task tracker that it can spawn *only* 1 child JVM at a time to do maps for a >>> single job? How do I get it to spawn 4-6 children for mapping jobs at once? >>> Josh Ferguson. >>> >>> >>> On Wed, Jan 28, 2009 at 7:38 AM, Ashish Thusoo <[email protected]>wrote: >>> >>>> How many nodes do you have in your map/reduce cluster? It could just be >>>> the case tht the cluster does not have enough map slots so all 344 maps >>>> cnnot be run simultaneously. Suppose you had a 4 node cluster. Then by your >>>> configuration you would have a total of 20 map slots. So you would see 20 >>>> mappers started off and then you as each mapper finishes another would move >>>> from pending to started. This could give an illusion that mappers are >>>> running one at a time, though at anytime 20 are running concurrently.. >>>> >>>> Also you could potentially decrease the number of mappers being run by >>>> setting mapred.min.split.size. >>>> >>>> Ashish >>>> >>>> ________________________________________ >>>> From: Josh Ferguson [[email protected]] >>>> Sent: Tuesday, January 27, 2009 9:20 PM >>>> To: [email protected] >>>> Subject: Number of tasks >>>> >>>> Ok so I'm experimenting with the slow running hive query I was having >>>> earlier. It was indeed only processing one map task at a time even >>>> though I *think* I told it to do more. Anyone who is good with hadoop >>>> feel free to speak up here as well, this is my first foray into trying >>>> to setup jobs for production. Here is the relevant configuration used >>>> on the job tracker and task tracker machines. >>>> >>>> <property> >>>> <name>mapred.map.tasks</name> >>>> <value>7</value> >>>> <description>The default number of map tasks per job. Typically >>>> set >>>> to a prime several times greater than number of available hosts. >>>> Ignored when mapred.job.tracker is "local". >>>> </description> >>>> </property> >>>> >>>> <property> >>>> <name>mapred.reduce.parallel.copies</name> >>>> <value>20</value> >>>> <description>The default number of parallel transfers run by reduce >>>> during the copy(shuffle) phase. >>>> </description> >>>> </property> >>>> >>>> <property> >>>> <name>mapred.tasktracker.map.tasks.maximum</name> >>>> <value>5</value> >>>> <description>The maximum number of map tasks that will be run >>>> simultaneously by a task tracker. >>>> </description> >>>> </property> >>>> >>>> <property> >>>> <name>mapred.tasktracker.reduce.tasks.maximum</name> >>>> <value>5</value> >>>> <description>The maximum number of reduce tasks that will be run >>>> simultaneously by a task tracker. >>>> </description> >>>> </property> >>>> >>>> The query was SELECT COUNT(DISTINCT(table.field)) FROM table; >>>> >>>> Anyone know why this might only be running one map task at a time? >>>> Takes about 5 minutes to go through 344 of them at this rate. >>>> >>>> Josh Ferguson >>>> >>> >>> >> >
