The answer is yes. I have checked my code and I was generating one map key
for each table when I didn't  need to do it.

Now, I'm generating keys that are including the name of the table and a
unique id. That information is used during the MultiOutput to generate
proper outputs for each table.

2011/10/23 Raimon Bosch <[email protected]>

> Thanks for your help,
>
> In fact, I'm using MultipleOutputFormat to generate one file for each hive
> table and in this case I'm generating only one of the possible hive tables.
> Can I use MultipleOutputFormat and still distribute my keys over all the
> cluster?
>
> 2011/10/23 Ayon Sinha <[email protected]>
>
>> Looks like that is the reducer who is actually doing the work with 14M
>> input records.
>>
>>
>>  Reduce input groups 1
>>  Combine output records 0
>>  Reduce shuffle bytes 5,135,004,496
>>  Reduce output records 14,232,592
>>  Spilled Records 14,232,592
>>  Combine input records 0
>>  Reduce input records 14,232,592
>>
>>
>>
>> Other reducers have this:
>> Reduce output records0
>> Spilled Records0
>> Combine input records0
>> Reduce input records0
>>
>> -Ayon
>> See My Photos on Flickr
>> Also check out my Blog for answers to commonly asked questions.
>>
>>
>>
>> ________________________________
>> From: Raimon Bosch <[email protected]>
>> To: [email protected]
>> Sent: Saturday, October 22, 2011 6:01 PM
>> Subject: why one of the reducers it's always slower?
>>
>> Hi all,
>>
>> I'm executing one job to convert logs into hive tables. The times are very
>> good once we have added a proper number of nodes but the reduce phase
>> spends
>> always more time in one of the machines.
>>
>> task_201110211442_0086_r_000000<
>> http://204.236.208.103:50030/taskdetails.jsp?jobid=job_201110211442_0086&tipid=task_201110211442_0086_r_000000
>> >
>> 100.00%
>> reduce > reduce
>> 23-Oct-2011 00:26:42
>> 23-Oct-2011 00:28:09 (1mins, 27sec)
>>
>> 9<
>> http://204.236.208.103:50030/taskstats.jsp?jobid=job_201110211442_0086&tipid=task_201110211442_0086_r_000000
>> >
>> task_201110211442_0086_r_000001<
>> http://204.236.208.103:50030/taskdetails.jsp?jobid=job_201110211442_0086&tipid=task_201110211442_0086_r_000001
>> >
>> 100.00%
>> reduce > reduce
>> 23-Oct-2011 00:26:42
>> 23-Oct-2011 00:28:10 (1mins, 27sec)
>>
>> 9<
>> http://204.236.208.103:50030/taskstats.jsp?jobid=job_201110211442_0086&tipid=task_201110211442_0086_r_000001
>> >
>> task_201110211442_0086_r_000002<
>> http://204.236.208.103:50030/taskdetails.jsp?jobid=job_201110211442_0086&tipid=task_201110211442_0086_r_000002
>> >
>> 100.00%
>> reduce > reduce
>> 23-Oct-2011 00:26:43
>> 23-Oct-2011 00:28:10 (1mins, 27sec)
>>
>> 9<
>> http://204.236.208.103:50030/taskstats.jsp?jobid=job_201110211442_0086&tipid=task_201110211442_0086_r_000002
>> >
>> task_201110211442_0086_r_000003<
>> http://204.236.208.103:50030/taskdetails.jsp?jobid=job_201110211442_0086&tipid=task_201110211442_0086_r_000003
>> >
>> 100.00%
>> reduce > reduce
>> 23-Oct-2011 00:26:43
>> 23-Oct-2011 00:28:10 (1mins, 27sec)
>>
>> 9<
>> http://204.236.208.103:50030/taskstats.jsp?jobid=job_201110211442_0086&tipid=task_201110211442_0086_r_000003
>> >
>> task_201110211442_0086_r_000004<
>> http://204.236.208.103:50030/taskdetails.jsp?jobid=job_201110211442_0086&tipid=task_201110211442_0086_r_000004
>> >
>> 100.00%
>> reduce > reduce
>> 23-Oct-2011 00:26:44
>> 23-Oct-2011 00:35:56 (9mins, 11sec)
>>
>> 10<
>> http://204.236.208.103:50030/taskstats.jsp?jobid=job_201110211442_0086&tipid=task_201110211442_0086_r_000004
>> >
>> task_201110211442_0086_r_000005<
>> http://204.236.208.103:50030/taskdetails.jsp?jobid=job_201110211442_0086&tipid=task_201110211442_0086_r_000005
>> >
>> 100.00%
>> reduce > reduce
>> 23-Oct-2011 00:26:44
>> 23-Oct-2011 00:28:09 (1mins, 24sec)
>>
>> 9<
>> http://204.236.208.103:50030/taskstats.jsp?jobid=job_201110211442_0086&tipid=task_201110211442_0086_r_000005
>> >
>>
>> As you can see in the statistics from 6 reduce executions one is spending
>> 9
>> minutes while the rest is spending 1 minute. I think that it is because
>> one
>> of the reducers has to spend time sorting the results from the rest of
>> nodes.
>>
>> There is a way to reduce this time?
>>
>> Thanks in advance,
>> Raimon Bosch
>>
>
>

Reply via email to