I also just wanted to add, here are the counters for the one reduce task

*File Systems*S3 bytes written124Local bytes read5,667,846Local bytes
written5,667,846
*Map-Reduce Framework*Reduce input groups1Combine output records0Reduce
output records0Combine input records0Reduce input records104,960



100.00%reduce > reduce
28-Jan-2009 01:40:44
28-Jan-2009 01:46:32 (5mins, 47sec)

8

Seems like a long time for such a small set of data (This set is even more
pared down than the previous set I was using).

Josh Ferguson

On Wed, Jan 28, 2009 at 3:09 PM, Josh Ferguson <[email protected]> wrote:

> So I get all of that. I think what is happening is that I'm mapping a lot
> of files that are also small in total size (because I have them partitioned)
> so that it is automatically choosing to only execute one mapper even though
> my machine has a total of 5 available. This is causing the problem that it's
> doubling my total map time due to task setup overhead. There is probably
> nothing I can do to get around this other than to try and store so many
> small files.
> As far as the reduce speed I'm not really sure what is causing that to go
> so slow. I'm reducing (via hive's count function) 16 MB of files which is
> about 144,000 records, it wouldn't take more than 5 seconds to do somewhere
> else and it's taking over 18 minutes for my task tracker to do it. I just
> have a feeling that I'm doing something terribly wrong to get performance
> like that.
> Does anyone else have benchmarks on one or two task trackers to do a count
> on some rows?
>
> Josh Ferguson
>
>
> On Wed, Jan 28, 2009 at 11:31 AM, Aaron Kimball <[email protected]>wrote:
>
>> Hi Josh,
>>
>> "Jobs" and "Tasks" mean different things in Hadoop. A job is comprised of
>> many tasks (in your case, 344 map tasks + some number of reduce tasks).
>> There are effectively two levels of schedulers at work here. At the highest
>> level, a single job is dequeued off the front of the FIFO list of jobs. Its
>> tasks are then eligible for execution. TaskTrackers will then dequeue
>> individual map tasks out of this job, and execute as many tasks in parallel
>> as they're configured to. So your TaskTracker may actually be running five
>> tasks in parallel -- five map splits -- but as Ashish points out, that may
>> look like it's only "retiring" one task at a time, even though it's got four
>> others in flight simultaneously.
>>
>> But if you enqueue a second job, it gets in line behind all the tasks
>> associated with the first job. So on the top-level status page, that 2nd job
>> will list as "pending" because the map slots are all occupied by the first
>> job (which is actually most likely running multiple tasks in parallel). Only
>> after all the mappers for the first job are done, will mappers for the
>> second job be considered. If you click on job details for the first
>> (running) job, and then click on "Mappers," it will show you a status line
>> for each mapper running. This can confirm whether you're actually running 1
>> or 5 mappers concurrently.
>>
>> There are other top-level job schedulers (FairScheduler,
>> CapacityScheduler) which reserve capacity for multiple jobs to execute at
>> the same time, splitting the available task slots between them.
>>
>> Also, in Hive, try:
>>
>> hive> set mapred.reduce.tasks = 5;
>>
>> before you run your SELECT query. While the number of map tasks is
>> auto-determined based on the size of the input, the number of reduce tasks
>> is user-controlled. You may see better reduce performance and more
>> parallelism if you manually set that figure to equal the number of reduce
>> task slots available.
>>
>> - Aaron
>>
>>
>> On Wed, Jan 28, 2009 at 9:52 AM, Josh Ferguson <[email protected]>wrote:
>>
>>> I only have one task tracker right now because I'm just setting up some
>>> testing. But that one machine only runs 1 mapper at a time. In the job
>>> tracker web interface I only ever see 1 job running at a time and no jobs
>>> ever start simultaneously from what I can tell. Is the behavior of a single
>>> task tracker that it can spawn *only* 1 child JVM at a time to do maps for a
>>> single job? How do I get it to spawn 4-6 children for mapping jobs at once?
>>> Josh Ferguson.
>>>
>>>
>>> On Wed, Jan 28, 2009 at 7:38 AM, Ashish Thusoo <[email protected]>wrote:
>>>
>>>> How many nodes do you have in your map/reduce cluster? It could just be
>>>> the case tht the cluster does not have enough map slots so all 344 maps
>>>> cnnot be run simultaneously. Suppose you had a 4 node cluster. Then by your
>>>> configuration you would have a total of 20 map slots. So you would see 20
>>>> mappers started off and then you as each mapper finishes another would move
>>>> from pending to started. This could give an illusion that mappers are
>>>> running one at a time, though at anytime 20 are running concurrently..
>>>>
>>>> Also you could potentially decrease the number of mappers being run by
>>>> setting mapred.min.split.size.
>>>>
>>>> Ashish
>>>>
>>>> ________________________________________
>>>> From: Josh Ferguson [[email protected]]
>>>> Sent: Tuesday, January 27, 2009 9:20 PM
>>>> To: [email protected]
>>>> Subject: Number of tasks
>>>>
>>>> Ok so I'm experimenting with the slow running hive query I was having
>>>> earlier. It was indeed only processing one map task at a time even
>>>> though I *think* I told it to do more. Anyone who is good with hadoop
>>>> feel free to speak up here as well, this is my first foray into trying
>>>> to setup jobs for production. Here is the relevant configuration used
>>>> on the job tracker and task tracker machines.
>>>>
>>>>   <property>
>>>>     <name>mapred.map.tasks</name>
>>>>     <value>7</value>
>>>>     <description>The default number of map tasks per job.  Typically
>>>> set
>>>>     to a prime several times greater than number of available hosts.
>>>>     Ignored when mapred.job.tracker is "local".
>>>>     </description>
>>>>   </property>
>>>>
>>>>   <property>
>>>>     <name>mapred.reduce.parallel.copies</name>
>>>>     <value>20</value>
>>>>     <description>The default number of parallel transfers run by reduce
>>>>     during the copy(shuffle) phase.
>>>>     </description>
>>>>   </property>
>>>>
>>>>   <property>
>>>>     <name>mapred.tasktracker.map.tasks.maximum</name>
>>>>     <value>5</value>
>>>>     <description>The maximum number of map tasks that will be run
>>>>     simultaneously by a task tracker.
>>>>     </description>
>>>>   </property>
>>>>
>>>>   <property>
>>>>     <name>mapred.tasktracker.reduce.tasks.maximum</name>
>>>>     <value>5</value>
>>>>     <description>The maximum number of reduce tasks that will be run
>>>>     simultaneously by a task tracker.
>>>>     </description>
>>>>   </property>
>>>>
>>>> The query was SELECT COUNT(DISTINCT(table.field)) FROM table;
>>>>
>>>> Anyone know why this might only be running one map task at a time?
>>>> Takes about 5 minutes to go through 344 of them at this rate.
>>>>
>>>> Josh Ferguson
>>>>
>>>
>>>
>>
>

Reply via email to