The answer is yes. I have checked my code and I was generating one map key for each table when I didn't need to do it.
Now, I'm generating keys that are including the name of the table and a unique id. That information is used during the MultiOutput to generate proper outputs for each table. 2011/10/23 Raimon Bosch <[email protected]> > Thanks for your help, > > In fact, I'm using MultipleOutputFormat to generate one file for each hive > table and in this case I'm generating only one of the possible hive tables. > Can I use MultipleOutputFormat and still distribute my keys over all the > cluster? > > 2011/10/23 Ayon Sinha <[email protected]> > >> Looks like that is the reducer who is actually doing the work with 14M >> input records. >> >> >> Reduce input groups 1 >> Combine output records 0 >> Reduce shuffle bytes 5,135,004,496 >> Reduce output records 14,232,592 >> Spilled Records 14,232,592 >> Combine input records 0 >> Reduce input records 14,232,592 >> >> >> >> Other reducers have this: >> Reduce output records0 >> Spilled Records0 >> Combine input records0 >> Reduce input records0 >> >> -Ayon >> See My Photos on Flickr >> Also check out my Blog for answers to commonly asked questions. >> >> >> >> ________________________________ >> From: Raimon Bosch <[email protected]> >> To: [email protected] >> Sent: Saturday, October 22, 2011 6:01 PM >> Subject: why one of the reducers it's always slower? >> >> Hi all, >> >> I'm executing one job to convert logs into hive tables. The times are very >> good once we have added a proper number of nodes but the reduce phase >> spends >> always more time in one of the machines. >> >> task_201110211442_0086_r_000000< >> http://204.236.208.103:50030/taskdetails.jsp?jobid=job_201110211442_0086&tipid=task_201110211442_0086_r_000000 >> > >> 100.00% >> reduce > reduce >> 23-Oct-2011 00:26:42 >> 23-Oct-2011 00:28:09 (1mins, 27sec) >> >> 9< >> http://204.236.208.103:50030/taskstats.jsp?jobid=job_201110211442_0086&tipid=task_201110211442_0086_r_000000 >> > >> task_201110211442_0086_r_000001< >> http://204.236.208.103:50030/taskdetails.jsp?jobid=job_201110211442_0086&tipid=task_201110211442_0086_r_000001 >> > >> 100.00% >> reduce > reduce >> 23-Oct-2011 00:26:42 >> 23-Oct-2011 00:28:10 (1mins, 27sec) >> >> 9< >> http://204.236.208.103:50030/taskstats.jsp?jobid=job_201110211442_0086&tipid=task_201110211442_0086_r_000001 >> > >> task_201110211442_0086_r_000002< >> http://204.236.208.103:50030/taskdetails.jsp?jobid=job_201110211442_0086&tipid=task_201110211442_0086_r_000002 >> > >> 100.00% >> reduce > reduce >> 23-Oct-2011 00:26:43 >> 23-Oct-2011 00:28:10 (1mins, 27sec) >> >> 9< >> http://204.236.208.103:50030/taskstats.jsp?jobid=job_201110211442_0086&tipid=task_201110211442_0086_r_000002 >> > >> task_201110211442_0086_r_000003< >> http://204.236.208.103:50030/taskdetails.jsp?jobid=job_201110211442_0086&tipid=task_201110211442_0086_r_000003 >> > >> 100.00% >> reduce > reduce >> 23-Oct-2011 00:26:43 >> 23-Oct-2011 00:28:10 (1mins, 27sec) >> >> 9< >> http://204.236.208.103:50030/taskstats.jsp?jobid=job_201110211442_0086&tipid=task_201110211442_0086_r_000003 >> > >> task_201110211442_0086_r_000004< >> http://204.236.208.103:50030/taskdetails.jsp?jobid=job_201110211442_0086&tipid=task_201110211442_0086_r_000004 >> > >> 100.00% >> reduce > reduce >> 23-Oct-2011 00:26:44 >> 23-Oct-2011 00:35:56 (9mins, 11sec) >> >> 10< >> http://204.236.208.103:50030/taskstats.jsp?jobid=job_201110211442_0086&tipid=task_201110211442_0086_r_000004 >> > >> task_201110211442_0086_r_000005< >> http://204.236.208.103:50030/taskdetails.jsp?jobid=job_201110211442_0086&tipid=task_201110211442_0086_r_000005 >> > >> 100.00% >> reduce > reduce >> 23-Oct-2011 00:26:44 >> 23-Oct-2011 00:28:09 (1mins, 24sec) >> >> 9< >> http://204.236.208.103:50030/taskstats.jsp?jobid=job_201110211442_0086&tipid=task_201110211442_0086_r_000005 >> > >> >> As you can see in the statistics from 6 reduce executions one is spending >> 9 >> minutes while the rest is spending 1 minute. I think that it is because >> one >> of the reducers has to spend time sorting the results from the rest of >> nodes. >> >> There is a way to reduce this time? >> >> Thanks in advance, >> Raimon Bosch >> > >
