Re: Self join problem

Defenestrator Tue, 03 Nov 2009 18:46:12 -0800

Hi Namit,

Thanks for your suggestion.


I tried changing the query as you had suggested by moving the m1.dt = m2.dt
to the on clause.  It increased the number of reduce jobs to 2.  So now
there are two processes running on two nodes at 100% consuming a lot of
memory.  Is there a reason why hive doesn't spawn more reduce jobs for this
query?

On Tue, Nov 3, 2009 at 4:47 PM, Namit Jain <[email protected]> wrote:

>  Get the join condition in the on condition:
>
>
>
> insert overwrite table foo1
>
> select m1.id as id_1, m2.id as id_2, count(1), m1.dt
>
> from m1 join m2 on m1.dt=m2.dt where m1.id <> m2.id and m1.id < m2.idgroup by
> m1.id, m2.id, m1.dt;
>
>
>
>
>
>
>
> *From:* Defenestrator [mailto:[email protected]]
> *Sent:* Tuesday, November 03, 2009 4:44 PM
> *To:* [email protected]
> *Subject:* Self join problem
>
>
>
> Hello,
>
>
>
> I'm trying to run the following query where m1 and m2 have the same data
> (>29M rows) on a 3-node hadoop cluster.  I'm essentially trying to do a self
> join.  It ends up running 269 map jobs and 1 reduce job.  The map jobs
> complete but the reduce job just runs on one process on one of the hadoop
> nodes at 100% cpu utilization and just slowly increases in memory
> consumption.  The reduce job never goes beyond 82% complete despite letting
> it run for a day.
>
>
>
> I am running on 0.5.0 based on this morning's trunk.
>
>
>
> insert overwrite table foo1
>
> select m1.id as id_1, m2.id as id_2, count(1), m1.dt
>
> from m1 join m2 where m1.id <> m2.id and m1.id < m2.id and m1.dt = m2.dt
> group by m1.id, m2.id, m1.dt;
>
>
>
> Any input would be appreciated.
>

Re: Self join problem

Reply via email to