RE: Self join problem

Namit Jain Tue, 03 Nov 2009 18:54:14 -0800

The number of reducers are inferred from the input data size. But, you can 
always overwrite it by setting mapred.reduce.tasks

From: Defenestrator [mailto:[email protected]]
Sent: Tuesday, November 03, 2009 6:46 PM
To: [email protected]
Subject: Re: Self join problem

Hi Namit,

Thanks for your suggestion.

I tried changing the query as you had suggested by moving the m1.dt = m2.dt to 
the on clause.  It increased the number of reduce jobs to 2.  So now there are 
two processes running on two nodes at 100% consuming a lot of memory.  Is there 
a reason why hive doesn't spawn more reduce jobs for this query?

On Tue, Nov 3, 2009 at 4:47 PM, Namit Jain 
<[email protected]<mailto:[email protected]>> wrote:
Get the join condition in the on condition:

insert overwrite table foo1
select m1.id<http://m1.id> as id_1, m2.id<http://m2.id> as id_2, count(1), m1.dt
from m1 join m2 on m1.dt=m2.dt where m1.id<http://m1.id> <> m2.id<http://m2.id> 
and m1.id<http://m1.id> < m2.id<http://m2.id> group by m1.id<http://m1.id>, 
m2.id<http://m2.id>, m1.dt;

From: Defenestrator 
[mailto:[email protected]<mailto:[email protected]>]
Sent: Tuesday, November 03, 2009 4:44 PM
To: [email protected]<mailto:[email protected]>
Subject: Self join problem

Hello,

I'm trying to run the following query where m1 and m2 have the same data (>29M 
rows) on a 3-node hadoop cluster.  I'm essentially trying to do a self join.  
It ends up running 269 map jobs and 1 reduce job.  The map jobs complete but 
the reduce job just runs on one process on one of the hadoop nodes at 100% cpu 
utilization and just slowly increases in memory consumption.  The reduce job 
never goes beyond 82% complete despite letting it run for a day.

I am running on 0.5.0 based on this morning's trunk.

insert overwrite table foo1
select m1.id<http://m1.id> as id_1, m2.id<http://m2.id> as id_2, count(1), m1.dt
from m1 join m2 where m1.id<http://m1.id> <> m2.id<http://m2.id> and 
m1.id<http://m1.id> < m2.id<http://m2.id> and m1.dt = m2.dt group by 
m1.id<http://m1.id>, m2.id<http://m2.id>, m1.dt;

Any input would be appreciated.

RE: Self join problem

Reply via email to