So I get all of that. I think what is happening is that I'm mapping a lot of
files that are also small in total size (because I have them partitioned) so
that it is automatically choosing to only execute one mapper even though my
machine has a total of 5 available. This is causing the problem that it's
doubling my total map time due to task setup overhead. There is probably
nothing I can do to get around this other than to try and store so many
small files.
As far as the reduce speed I'm not really sure what is causing that to go so
slow. I'm reducing (via hive's count function) 16 MB of files which is about
144,000 records, it wouldn't take more than 5 seconds to do somewhere else
and it's taking over 18 minutes for my task tracker to do it. I just have a
feeling that I'm doing something terribly wrong to get performance like
that.
Does anyone else have benchmarks on one or two task trackers to do a count
on some rows?

Josh Ferguson

On Wed, Jan 28, 2009 at 11:31 AM, Aaron Kimball <[email protected]> wrote:

> Hi Josh,
>
> "Jobs" and "Tasks" mean different things in Hadoop. A job is comprised of
> many tasks (in your case, 344 map tasks + some number of reduce tasks).
> There are effectively two levels of schedulers at work here. At the highest
> level, a single job is dequeued off the front of the FIFO list of jobs. Its
> tasks are then eligible for execution. TaskTrackers will then dequeue
> individual map tasks out of this job, and execute as many tasks in parallel
> as they're configured to. So your TaskTracker may actually be running five
> tasks in parallel -- five map splits -- but as Ashish points out, that may
> look like it's only "retiring" one task at a time, even though it's got four
> others in flight simultaneously.
>
> But if you enqueue a second job, it gets in line behind all the tasks
> associated with the first job. So on the top-level status page, that 2nd job
> will list as "pending" because the map slots are all occupied by the first
> job (which is actually most likely running multiple tasks in parallel). Only
> after all the mappers for the first job are done, will mappers for the
> second job be considered. If you click on job details for the first
> (running) job, and then click on "Mappers," it will show you a status line
> for each mapper running. This can confirm whether you're actually running 1
> or 5 mappers concurrently.
>
> There are other top-level job schedulers (FairScheduler, CapacityScheduler)
> which reserve capacity for multiple jobs to execute at the same time,
> splitting the available task slots between them.
>
> Also, in Hive, try:
>
> hive> set mapred.reduce.tasks = 5;
>
> before you run your SELECT query. While the number of map tasks is
> auto-determined based on the size of the input, the number of reduce tasks
> is user-controlled. You may see better reduce performance and more
> parallelism if you manually set that figure to equal the number of reduce
> task slots available.
>
> - Aaron
>
>
> On Wed, Jan 28, 2009 at 9:52 AM, Josh Ferguson <[email protected]> wrote:
>
>> I only have one task tracker right now because I'm just setting up some
>> testing. But that one machine only runs 1 mapper at a time. In the job
>> tracker web interface I only ever see 1 job running at a time and no jobs
>> ever start simultaneously from what I can tell. Is the behavior of a single
>> task tracker that it can spawn *only* 1 child JVM at a time to do maps for a
>> single job? How do I get it to spawn 4-6 children for mapping jobs at once?
>> Josh Ferguson.
>>
>>
>> On Wed, Jan 28, 2009 at 7:38 AM, Ashish Thusoo <[email protected]>wrote:
>>
>>> How many nodes do you have in your map/reduce cluster? It could just be
>>> the case tht the cluster does not have enough map slots so all 344 maps
>>> cnnot be run simultaneously. Suppose you had a 4 node cluster. Then by your
>>> configuration you would have a total of 20 map slots. So you would see 20
>>> mappers started off and then you as each mapper finishes another would move
>>> from pending to started. This could give an illusion that mappers are
>>> running one at a time, though at anytime 20 are running concurrently..
>>>
>>> Also you could potentially decrease the number of mappers being run by
>>> setting mapred.min.split.size.
>>>
>>> Ashish
>>>
>>> ________________________________________
>>> From: Josh Ferguson [[email protected]]
>>> Sent: Tuesday, January 27, 2009 9:20 PM
>>> To: [email protected]
>>> Subject: Number of tasks
>>>
>>> Ok so I'm experimenting with the slow running hive query I was having
>>> earlier. It was indeed only processing one map task at a time even
>>> though I *think* I told it to do more. Anyone who is good with hadoop
>>> feel free to speak up here as well, this is my first foray into trying
>>> to setup jobs for production. Here is the relevant configuration used
>>> on the job tracker and task tracker machines.
>>>
>>>   <property>
>>>     <name>mapred.map.tasks</name>
>>>     <value>7</value>
>>>     <description>The default number of map tasks per job.  Typically
>>> set
>>>     to a prime several times greater than number of available hosts.
>>>     Ignored when mapred.job.tracker is "local".
>>>     </description>
>>>   </property>
>>>
>>>   <property>
>>>     <name>mapred.reduce.parallel.copies</name>
>>>     <value>20</value>
>>>     <description>The default number of parallel transfers run by reduce
>>>     during the copy(shuffle) phase.
>>>     </description>
>>>   </property>
>>>
>>>   <property>
>>>     <name>mapred.tasktracker.map.tasks.maximum</name>
>>>     <value>5</value>
>>>     <description>The maximum number of map tasks that will be run
>>>     simultaneously by a task tracker.
>>>     </description>
>>>   </property>
>>>
>>>   <property>
>>>     <name>mapred.tasktracker.reduce.tasks.maximum</name>
>>>     <value>5</value>
>>>     <description>The maximum number of reduce tasks that will be run
>>>     simultaneously by a task tracker.
>>>     </description>
>>>   </property>
>>>
>>> The query was SELECT COUNT(DISTINCT(table.field)) FROM table;
>>>
>>> Anyone know why this might only be running one map task at a time?
>>> Takes about 5 minutes to go through 344 of them at this rate.
>>>
>>> Josh Ferguson
>>>
>>
>>
>

Reply via email to