Re: Self join problem

Defenestrator Tue, 03 Nov 2009 19:45:59 -0800

I was able to increase the number of reduce jobs manually to 32.  However,
it finishes 28 of them and the other 4 has the same behavior of using 100%
cpu and consuming a lot of memory.  I'm suspecting that it might be an issue
with the reduce job itself - is there a way to figure out what these jobs
are doing exactly?


Thanks.

On Tue, Nov 3, 2009 at 6:53 PM, Namit Jain <[email protected]> wrote:

>  The number of reducers are inferred from the input data size. But, you
> can always overwrite it by setting mapred.reduce.tasks
>
>
>
>
>
>
>
> *From:* Defenestrator [mailto:[email protected]]
> *Sent:* Tuesday, November 03, 2009 6:46 PM
>
> *To:* [email protected]
> *Subject:* Re: Self join problem
>
>
>
> Hi Namit,
>
>
>
> Thanks for your suggestion.
>
>
>
> I tried changing the query as you had suggested by moving the m1.dt = m2.dt
> to the on clause.  It increased the number of reduce jobs to 2.  So now
> there are two processes running on two nodes at 100% consuming a lot of
> memory.  Is there a reason why hive doesn't spawn more reduce jobs for this
> query?
>
>
>
> On Tue, Nov 3, 2009 at 4:47 PM, Namit Jain <[email protected]> wrote:
>
> Get the join condition in the on condition:
>
>
>
> insert overwrite table foo1
>
> select m1.id as id_1, m2.id as id_2, count(1), m1.dt
>
> from m1 join m2 on m1.dt=m2.dt where m1.id <> m2.id and m1.id < m2.idgroup by
> m1.id, m2.id, m1.dt;
>
>
>
>
>
>
>
> *From:* Defenestrator [mailto:[email protected]]
> *Sent:* Tuesday, November 03, 2009 4:44 PM
> *To:* [email protected]
> *Subject:* Self join problem
>
>
>
> Hello,
>
>
>
> I'm trying to run the following query where m1 and m2 have the same data
> (>29M rows) on a 3-node hadoop cluster.  I'm essentially trying to do a self
> join.  It ends up running 269 map jobs and 1 reduce job.  The map jobs
> complete but the reduce job just runs on one process on one of the hadoop
> nodes at 100% cpu utilization and just slowly increases in memory
> consumption.  The reduce job never goes beyond 82% complete despite letting
> it run for a day.
>
>
>
> I am running on 0.5.0 based on this morning's trunk.
>
>
>
> insert overwrite table foo1
>
> select m1.id as id_1, m2.id as id_2, count(1), m1.dt
>
> from m1 join m2 where m1.id <> m2.id and m1.id < m2.id and m1.dt = m2.dt
> group by m1.id, m2.id, m1.dt;
>
>
>
> Any input would be appreciated.
>
>
>

Re: Self join problem

Reply via email to