Hi Namit, Thanks for your suggestion.
I tried changing the query as you had suggested by moving the m1.dt = m2.dt to the on clause. It increased the number of reduce jobs to 2. So now there are two processes running on two nodes at 100% consuming a lot of memory. Is there a reason why hive doesn't spawn more reduce jobs for this query? On Tue, Nov 3, 2009 at 4:47 PM, Namit Jain <[email protected]> wrote: > Get the join condition in the on condition: > > > > insert overwrite table foo1 > > select m1.id as id_1, m2.id as id_2, count(1), m1.dt > > from m1 join m2 on m1.dt=m2.dt where m1.id <> m2.id and m1.id < m2.idgroup by > m1.id, m2.id, m1.dt; > > > > > > > > *From:* Defenestrator [mailto:[email protected]] > *Sent:* Tuesday, November 03, 2009 4:44 PM > *To:* [email protected] > *Subject:* Self join problem > > > > Hello, > > > > I'm trying to run the following query where m1 and m2 have the same data > (>29M rows) on a 3-node hadoop cluster. I'm essentially trying to do a self > join. It ends up running 269 map jobs and 1 reduce job. The map jobs > complete but the reduce job just runs on one process on one of the hadoop > nodes at 100% cpu utilization and just slowly increases in memory > consumption. The reduce job never goes beyond 82% complete despite letting > it run for a day. > > > > I am running on 0.5.0 based on this morning's trunk. > > > > insert overwrite table foo1 > > select m1.id as id_1, m2.id as id_2, count(1), m1.dt > > from m1 join m2 where m1.id <> m2.id and m1.id < m2.id and m1.dt = m2.dt > group by m1.id, m2.id, m1.dt; > > > > Any input would be appreciated. >
