Josh,

Given that you're running all of this on the same machine, the shuffle
process should be nearly instantaneous since it's local copies. This strikes
me as less a Hive issue than a Hadoop MapReduce one. You might want to try
emailing [email protected] and see if someone there can tell you
what settings might be misconfigured to lead to such poor local shuffle
performance.

Sorry this is such a snafu for you.

- Aaron

On Wed, Jan 28, 2009 at 4:53 PM, Josh Ferguson <[email protected]> wrote:

> I also just wanted to add, here are the counters for the one reduce task
>
> *File Systems* S3 bytes written124Local bytes read5,667,846Local bytes
> written 5,667,846
> *Map-Reduce Framework*Reduce input groups1Combine output records 0Reduce
> output records0Combine input records0 Reduce input records104,960
>
>
>
> 100.00% reduce > reduce
> 28-Jan-2009 01:40:44
> 28-Jan-2009 01:46:32 (5mins, 47sec)
>
> 8
>
> Seems like a long time for such a small set of data (This set is even more
> pared down than the previous set I was using).
>
> Josh Ferguson
>
> On Wed, Jan 28, 2009 at 3:09 PM, Josh Ferguson <[email protected]> wrote:
>
>> So I get all of that. I think what is happening is that I'm mapping a lot
>> of files that are also small in total size (because I have them partitioned)
>> so that it is automatically choosing to only execute one mapper even though
>> my machine has a total of 5 available. This is causing the problem that it's
>> doubling my total map time due to task setup overhead. There is probably
>> nothing I can do to get around this other than to try and store so many
>> small files.
>> As far as the reduce speed I'm not really sure what is causing that to go
>> so slow. I'm reducing (via hive's count function) 16 MB of files which is
>> about 144,000 records, it wouldn't take more than 5 seconds to do somewhere
>> else and it's taking over 18 minutes for my task tracker to do it. I just
>> have a feeling that I'm doing something terribly wrong to get performance
>> like that.
>> Does anyone else have benchmarks on one or two task trackers to do a count
>> on some rows?
>>
>> Josh Ferguson
>>
>>
>> On Wed, Jan 28, 2009 at 11:31 AM, Aaron Kimball <[email protected]>wrote:
>>
>>> Hi Josh,
>>>
>>> "Jobs" and "Tasks" mean different things in Hadoop. A job is comprised of
>>> many tasks (in your case, 344 map tasks + some number of reduce tasks).
>>> There are effectively two levels of schedulers at work here. At the highest
>>> level, a single job is dequeued off the front of the FIFO list of jobs. Its
>>> tasks are then eligible for execution. TaskTrackers will then dequeue
>>> individual map tasks out of this job, and execute as many tasks in parallel
>>> as they're configured to. So your TaskTracker may actually be running five
>>> tasks in parallel -- five map splits -- but as Ashish points out, that may
>>> look like it's only "retiring" one task at a time, even though it's got four
>>> others in flight simultaneously.
>>>
>>> But if you enqueue a second job, it gets in line behind all the tasks
>>> associated with the first job. So on the top-level status page, that 2nd job
>>> will list as "pending" because the map slots are all occupied by the first
>>> job (which is actually most likely running multiple tasks in parallel). Only
>>> after all the mappers for the first job are done, will mappers for the
>>> second job be considered. If you click on job details for the first
>>> (running) job, and then click on "Mappers," it will show you a status line
>>> for each mapper running. This can confirm whether you're actually running 1
>>> or 5 mappers concurrently.
>>>
>>> There are other top-level job schedulers (FairScheduler,
>>> CapacityScheduler) which reserve capacity for multiple jobs to execute at
>>> the same time, splitting the available task slots between them.
>>>
>>> Also, in Hive, try:
>>>
>>> hive> set mapred.reduce.tasks = 5;
>>>
>>> before you run your SELECT query. While the number of map tasks is
>>> auto-determined based on the size of the input, the number of reduce tasks
>>> is user-controlled. You may see better reduce performance and more
>>> parallelism if you manually set that figure to equal the number of reduce
>>> task slots available.
>>>
>>> - Aaron
>>>
>>>
>>> On Wed, Jan 28, 2009 at 9:52 AM, Josh Ferguson <[email protected]>wrote:
>>>
>>>> I only have one task tracker right now because I'm just setting up some
>>>> testing. But that one machine only runs 1 mapper at a time. In the job
>>>> tracker web interface I only ever see 1 job running at a time and no jobs
>>>> ever start simultaneously from what I can tell. Is the behavior of a single
>>>> task tracker that it can spawn *only* 1 child JVM at a time to do maps for 
>>>> a
>>>> single job? How do I get it to spawn 4-6 children for mapping jobs at once?
>>>> Josh Ferguson.
>>>>
>>>>
>>>> On Wed, Jan 28, 2009 at 7:38 AM, Ashish Thusoo <[email protected]>wrote:
>>>>
>>>>> How many nodes do you have in your map/reduce cluster? It could just be
>>>>> the case tht the cluster does not have enough map slots so all 344 maps
>>>>> cnnot be run simultaneously. Suppose you had a 4 node cluster. Then by 
>>>>> your
>>>>> configuration you would have a total of 20 map slots. So you would see 20
>>>>> mappers started off and then you as each mapper finishes another would 
>>>>> move
>>>>> from pending to started. This could give an illusion that mappers are
>>>>> running one at a time, though at anytime 20 are running concurrently..
>>>>>
>>>>> Also you could potentially decrease the number of mappers being run by
>>>>> setting mapred.min.split.size.
>>>>>
>>>>> Ashish
>>>>>
>>>>> ________________________________________
>>>>> From: Josh Ferguson [[email protected]]
>>>>> Sent: Tuesday, January 27, 2009 9:20 PM
>>>>> To: [email protected]
>>>>> Subject: Number of tasks
>>>>>
>>>>> Ok so I'm experimenting with the slow running hive query I was having
>>>>> earlier. It was indeed only processing one map task at a time even
>>>>> though I *think* I told it to do more. Anyone who is good with hadoop
>>>>> feel free to speak up here as well, this is my first foray into trying
>>>>> to setup jobs for production. Here is the relevant configuration used
>>>>> on the job tracker and task tracker machines.
>>>>>
>>>>>   <property>
>>>>>     <name>mapred.map.tasks</name>
>>>>>     <value>7</value>
>>>>>     <description>The default number of map tasks per job.  Typically
>>>>> set
>>>>>     to a prime several times greater than number of available hosts.
>>>>>     Ignored when mapred.job.tracker is "local".
>>>>>     </description>
>>>>>   </property>
>>>>>
>>>>>   <property>
>>>>>     <name>mapred.reduce.parallel.copies</name>
>>>>>     <value>20</value>
>>>>>     <description>The default number of parallel transfers run by reduce
>>>>>     during the copy(shuffle) phase.
>>>>>     </description>
>>>>>   </property>
>>>>>
>>>>>   <property>
>>>>>     <name>mapred.tasktracker.map.tasks.maximum</name>
>>>>>     <value>5</value>
>>>>>     <description>The maximum number of map tasks that will be run
>>>>>     simultaneously by a task tracker.
>>>>>     </description>
>>>>>   </property>
>>>>>
>>>>>   <property>
>>>>>     <name>mapred.tasktracker.reduce.tasks.maximum</name>
>>>>>     <value>5</value>
>>>>>     <description>The maximum number of reduce tasks that will be run
>>>>>     simultaneously by a task tracker.
>>>>>     </description>
>>>>>   </property>
>>>>>
>>>>> The query was SELECT COUNT(DISTINCT(table.field)) FROM table;
>>>>>
>>>>> Anyone know why this might only be running one map task at a time?
>>>>> Takes about 5 minutes to go through 344 of them at this rate.
>>>>>
>>>>> Josh Ferguson
>>>>>
>>>>
>>>>
>>>
>>
>

Reply via email to