Josh, Given that you're running all of this on the same machine, the shuffle process should be nearly instantaneous since it's local copies. This strikes me as less a Hive issue than a Hadoop MapReduce one. You might want to try emailing [email protected] and see if someone there can tell you what settings might be misconfigured to lead to such poor local shuffle performance.
Sorry this is such a snafu for you. - Aaron On Wed, Jan 28, 2009 at 4:53 PM, Josh Ferguson <[email protected]> wrote: > I also just wanted to add, here are the counters for the one reduce task > > *File Systems* S3 bytes written124Local bytes read5,667,846Local bytes > written 5,667,846 > *Map-Reduce Framework*Reduce input groups1Combine output records 0Reduce > output records0Combine input records0 Reduce input records104,960 > > > > 100.00% reduce > reduce > 28-Jan-2009 01:40:44 > 28-Jan-2009 01:46:32 (5mins, 47sec) > > 8 > > Seems like a long time for such a small set of data (This set is even more > pared down than the previous set I was using). > > Josh Ferguson > > On Wed, Jan 28, 2009 at 3:09 PM, Josh Ferguson <[email protected]> wrote: > >> So I get all of that. I think what is happening is that I'm mapping a lot >> of files that are also small in total size (because I have them partitioned) >> so that it is automatically choosing to only execute one mapper even though >> my machine has a total of 5 available. This is causing the problem that it's >> doubling my total map time due to task setup overhead. There is probably >> nothing I can do to get around this other than to try and store so many >> small files. >> As far as the reduce speed I'm not really sure what is causing that to go >> so slow. I'm reducing (via hive's count function) 16 MB of files which is >> about 144,000 records, it wouldn't take more than 5 seconds to do somewhere >> else and it's taking over 18 minutes for my task tracker to do it. I just >> have a feeling that I'm doing something terribly wrong to get performance >> like that. >> Does anyone else have benchmarks on one or two task trackers to do a count >> on some rows? >> >> Josh Ferguson >> >> >> On Wed, Jan 28, 2009 at 11:31 AM, Aaron Kimball <[email protected]>wrote: >> >>> Hi Josh, >>> >>> "Jobs" and "Tasks" mean different things in Hadoop. A job is comprised of >>> many tasks (in your case, 344 map tasks + some number of reduce tasks). >>> There are effectively two levels of schedulers at work here. At the highest >>> level, a single job is dequeued off the front of the FIFO list of jobs. Its >>> tasks are then eligible for execution. TaskTrackers will then dequeue >>> individual map tasks out of this job, and execute as many tasks in parallel >>> as they're configured to. So your TaskTracker may actually be running five >>> tasks in parallel -- five map splits -- but as Ashish points out, that may >>> look like it's only "retiring" one task at a time, even though it's got four >>> others in flight simultaneously. >>> >>> But if you enqueue a second job, it gets in line behind all the tasks >>> associated with the first job. So on the top-level status page, that 2nd job >>> will list as "pending" because the map slots are all occupied by the first >>> job (which is actually most likely running multiple tasks in parallel). Only >>> after all the mappers for the first job are done, will mappers for the >>> second job be considered. If you click on job details for the first >>> (running) job, and then click on "Mappers," it will show you a status line >>> for each mapper running. This can confirm whether you're actually running 1 >>> or 5 mappers concurrently. >>> >>> There are other top-level job schedulers (FairScheduler, >>> CapacityScheduler) which reserve capacity for multiple jobs to execute at >>> the same time, splitting the available task slots between them. >>> >>> Also, in Hive, try: >>> >>> hive> set mapred.reduce.tasks = 5; >>> >>> before you run your SELECT query. While the number of map tasks is >>> auto-determined based on the size of the input, the number of reduce tasks >>> is user-controlled. You may see better reduce performance and more >>> parallelism if you manually set that figure to equal the number of reduce >>> task slots available. >>> >>> - Aaron >>> >>> >>> On Wed, Jan 28, 2009 at 9:52 AM, Josh Ferguson <[email protected]>wrote: >>> >>>> I only have one task tracker right now because I'm just setting up some >>>> testing. But that one machine only runs 1 mapper at a time. In the job >>>> tracker web interface I only ever see 1 job running at a time and no jobs >>>> ever start simultaneously from what I can tell. Is the behavior of a single >>>> task tracker that it can spawn *only* 1 child JVM at a time to do maps for >>>> a >>>> single job? How do I get it to spawn 4-6 children for mapping jobs at once? >>>> Josh Ferguson. >>>> >>>> >>>> On Wed, Jan 28, 2009 at 7:38 AM, Ashish Thusoo <[email protected]>wrote: >>>> >>>>> How many nodes do you have in your map/reduce cluster? It could just be >>>>> the case tht the cluster does not have enough map slots so all 344 maps >>>>> cnnot be run simultaneously. Suppose you had a 4 node cluster. Then by >>>>> your >>>>> configuration you would have a total of 20 map slots. So you would see 20 >>>>> mappers started off and then you as each mapper finishes another would >>>>> move >>>>> from pending to started. This could give an illusion that mappers are >>>>> running one at a time, though at anytime 20 are running concurrently.. >>>>> >>>>> Also you could potentially decrease the number of mappers being run by >>>>> setting mapred.min.split.size. >>>>> >>>>> Ashish >>>>> >>>>> ________________________________________ >>>>> From: Josh Ferguson [[email protected]] >>>>> Sent: Tuesday, January 27, 2009 9:20 PM >>>>> To: [email protected] >>>>> Subject: Number of tasks >>>>> >>>>> Ok so I'm experimenting with the slow running hive query I was having >>>>> earlier. It was indeed only processing one map task at a time even >>>>> though I *think* I told it to do more. Anyone who is good with hadoop >>>>> feel free to speak up here as well, this is my first foray into trying >>>>> to setup jobs for production. Here is the relevant configuration used >>>>> on the job tracker and task tracker machines. >>>>> >>>>> <property> >>>>> <name>mapred.map.tasks</name> >>>>> <value>7</value> >>>>> <description>The default number of map tasks per job. Typically >>>>> set >>>>> to a prime several times greater than number of available hosts. >>>>> Ignored when mapred.job.tracker is "local". >>>>> </description> >>>>> </property> >>>>> >>>>> <property> >>>>> <name>mapred.reduce.parallel.copies</name> >>>>> <value>20</value> >>>>> <description>The default number of parallel transfers run by reduce >>>>> during the copy(shuffle) phase. >>>>> </description> >>>>> </property> >>>>> >>>>> <property> >>>>> <name>mapred.tasktracker.map.tasks.maximum</name> >>>>> <value>5</value> >>>>> <description>The maximum number of map tasks that will be run >>>>> simultaneously by a task tracker. >>>>> </description> >>>>> </property> >>>>> >>>>> <property> >>>>> <name>mapred.tasktracker.reduce.tasks.maximum</name> >>>>> <value>5</value> >>>>> <description>The maximum number of reduce tasks that will be run >>>>> simultaneously by a task tracker. >>>>> </description> >>>>> </property> >>>>> >>>>> The query was SELECT COUNT(DISTINCT(table.field)) FROM table; >>>>> >>>>> Anyone know why this might only be running one map task at a time? >>>>> Takes about 5 minutes to go through 344 of them at this rate. >>>>> >>>>> Josh Ferguson >>>>> >>>> >>>> >>> >> >
