Re: Self join problem

Ryan LeCompte Tue, 10 Nov 2009 15:54:45 -0800

Any thoughts on this? I've only had luck by reducing the data on eachside of the join. Is this something Hive might be able to improve in afuture release of the query plan optimization?


Thanks,
Ryan




On Nov 3, 2009, at 10:55 PM, Ryan LeCompte <[email protected]> wrote:

I've had a similar issue with a small cluster. Is there any way thatyou can reduce the size of the data being joined on both sides? Ifyou search the forums for join issue, you will see the thread for myissue and get some tips.
Thanks,
Ryan
On Nov 3, 2009, at 10:45 PM, Defenestrator <[email protected]> wrote:
I was able to increase the number of reduce jobs manually to 32.However, it finishes 28 of them and the other 4 has the samebehavior of using 100% cpu and consuming a lot of memory. I'msuspecting that it might be an issue with the reduce job itself -is there a way to figure out what these jobs are doing exactly?
Thanks.
On Tue, Nov 3, 2009 at 6:53 PM, Namit Jain <[email protected]>wrote:The number of reducers are inferred from the input data size. But,you can always overwrite it by setting mapred.reduce.tasks
From: Defenestrator [mailto:[email protected]]
Sent: Tuesday, November 03, 2009 6:46 PM


To: [email protected]
Subject: Re: Self join problem


Hi Namit,



Thanks for your suggestion.
I tried changing the query as you had suggested by moving the m1.dt= m2.dt to the on clause. It increased the number of reduce jobsto 2. So now there are two processes running on two nodes at 100%consuming a lot of memory. Is there a reason why hive doesn'tspawn more reduce jobs for this query?
On Tue, Nov 3, 2009 at 4:47 PM, Namit Jain <[email protected]>wrote:
Get the join condition in the on condition:



insert overwrite table foo1

select m1.id as id_1, m2.id as id_2, count(1), m1.dt
from m1 join m2 on m1.dt=m2.dt where m1.id <> m2.id and m1.id <m2.id group by m1.id, m2.id, m1.dt;
From: Defenestrator [mailto:[email protected]]
Sent: Tuesday, November 03, 2009 4:44 PM
To: [email protected]
Subject: Self join problem



Hello,
I'm trying to run the following query where m1 and m2 have the samedata (>29M rows) on a 3-node hadoop cluster. I'm essentiallytrying to do a self join. It ends up running 269 map jobs and 1reduce job. The map jobs complete but the reduce job just runs onone process on one of the hadoop nodes at 100% cpu utilization andjust slowly increases in memory consumption. The reduce job nevergoes beyond 82% complete despite letting it run for a day.
I am running on 0.5.0 based on this morning's trunk.



insert overwrite table foo1

select m1.id as id_1, m2.id as id_2, count(1), m1.dt
from m1 join m2 where m1.id <> m2.id and m1.id < m2.id and m1.dt =m2.dt group by m1.id, m2.id, m1.dt;
Any input would be appreciated.

Re: Self join problem

Reply via email to