Get the join condition in the on condition: insert overwrite table foo1 select m1.id<http://m1.id> as id_1, m2.id<http://m2.id> as id_2, count(1), m1.dt from m1 join m2 on m1.dt=m2.dt where m1.id<http://m1.id> <> m2.id<http://m2.id> and m1.id<http://m1.id> < m2.id<http://m2.id> group by m1.id<http://m1.id>, m2.id<http://m2.id>, m1.dt;
From: Defenestrator [mailto:[email protected]] Sent: Tuesday, November 03, 2009 4:44 PM To: [email protected] Subject: Self join problem Hello, I'm trying to run the following query where m1 and m2 have the same data (>29M rows) on a 3-node hadoop cluster. I'm essentially trying to do a self join. It ends up running 269 map jobs and 1 reduce job. The map jobs complete but the reduce job just runs on one process on one of the hadoop nodes at 100% cpu utilization and just slowly increases in memory consumption. The reduce job never goes beyond 82% complete despite letting it run for a day. I am running on 0.5.0 based on this morning's trunk. insert overwrite table foo1 select m1.id<http://m1.id> as id_1, m2.id<http://m2.id> as id_2, count(1), m1.dt from m1 join m2 where m1.id<http://m1.id> <> m2.id<http://m2.id> and m1.id<http://m1.id> < m2.id<http://m2.id> and m1.dt = m2.dt group by m1.id<http://m1.id>, m2.id<http://m2.id>, m1.dt; Any input would be appreciated.
