Self join problem

Defenestrator Tue, 03 Nov 2009 18:31:30 -0800

Hello,

I'm trying to run the following query where m1 and m2 have the same data
(>29M rows) on a 3-node hadoop cluster.  I'm essentially trying to do a self
join.  It ends up running 269 map jobs and 1 reduce job.  The map jobs
complete but the reduce job just runs on one process on one of the hadoop
nodes at 100% cpu utilization and just slowly increases in memory
consumption.  The reduce job never goes beyond 82% complete despite letting
it run for a day.


I am running on 0.5.0 based on this morning's trunk.

insert overwrite table foo1
select m1.id as id_1, m2.id as id_2, count(1), m1.dt
from m1 join m2 where m1.id <> m2.id and m1.id < m2.id and m1.dt = m2.dt
group by m1.id, m2.id, m1.dt;

Any input would be appreciated.

Self join problem

Reply via email to