Re: Self join problem

Defenestrator Tue, 10 Nov 2009 18:36:29 -0800

I would definitely appreciate any insights on this from the list.  I tried
to reduce the query down to something that is easily understood and hive
still demonstrates a pretty poor join performance behavior on a three-node
hadoop cluster.


drop table m1;

drop table foo1;


> create table m1 (

mid int,

aid int,

dt string);


> LOAD DATA LOCAL INPATH 'm1' OVERWRITE INTO TABLE m1;


> create table foo1 (

aid_1 int,

aid_2 int,

mid bigint,

dt bigint

);


> set mapred.reduce.tasks=32;


> insert overwrite table foo1

select m1.aid as aid_1, m2.aid as aid_2, count(1), m1.dt as dt

from m1 m1 join m1 m2 on m1.aid = m2.aid and m1.dt = m2.dt group by m1.aid,
> m2.aid, m1.dt;


Attached is the file I'm using that only has 100k rows.  I've looked at the
benchmark (
http://issues.apache.org/jira/secure/attachment/12411185/hive_benchmark_2009-06-18.pdf)
and hive seems to be able to join much bigger data sets.  And I tried
running the same query on a single node dbms on my desktop, and it's able to
return results in less than 3-minutes.  While hive has been running for at
least 20 minutes now.

Thanks.

On Tue, Nov 10, 2009 at 3:53 PM, Ryan LeCompte <[email protected]> wrote:

> Any thoughts on this? I've only had luck by reducing the data on each side
> of the join. Is this something Hive might be able to improve in a future
> release of the query plan optimization?
>
> Thanks,
> Ryan
>
>
>
> On Nov 3, 2009, at 10:55 PM, Ryan LeCompte <[email protected]> wrote:
>
> I've had a similar issue with a small cluster. Is there any way that you
> can reduce the size of the data being joined on both sides? If you search
> the forums for join issue, you will see the thread for my issue and get some
> tips.
>
> Thanks,
> Ryan
>
>
>
> On Nov 3, 2009, at 10:45 PM, Defenestrator < <[email protected]>
> [email protected]> wrote:
>
> I was able to increase the number of reduce jobs manually to 32.  However,
> it finishes 28 of them and the other 4 has the same behavior of using 100%
> cpu and consuming a lot of memory.  I'm suspecting that it might be an issue
> with the reduce job itself - is there a way to figure out what these jobs
> are doing exactly?
>
> Thanks.
>
> On Tue, Nov 3, 2009 at 6:53 PM, Namit Jain < 
> <[email protected]><[email protected]>
> [email protected]> wrote:
>
>>  The number of reducers are inferred from the input data size. But, you
>> can always overwrite it by setting mapred.reduce.tasks
>>
>>
>>
>>
>>
>>
>>
>> *From:* Defenestrator [mailto: 
>> <[email protected]><[email protected]>
>> [email protected]]
>> *Sent:* Tuesday, November 03, 2009 6:46 PM
>>
>> *To:* <[email protected]> <[email protected]>
>> [email protected]
>> *Subject:* Re: Self join problem
>>
>>
>>
>> Hi Namit,
>>
>>
>>
>> Thanks for your suggestion.
>>
>>
>>
>> I tried changing the query as you had suggested by moving the m1.dt =
>> m2.dt to the on clause.  It increased the number of reduce jobs to 2.  So
>> now there are two processes running on two nodes at 100% consuming a lot of
>> memory.  Is there a reason why hive doesn't spawn more reduce jobs for this
>> query?
>>
>>
>>
>> On Tue, Nov 3, 2009 at 4:47 PM, Namit Jain < 
>> <[email protected]><[email protected]>
>> [email protected]> wrote:
>>
>> Get the join condition in the on condition:
>>
>>
>>
>> insert overwrite table foo1
>>
>> select m1.id as id_1, m2.id as id_2, count(1), m1.dt
>>
>> from m1 join m2 on m1.dt=m2.dt where m1.id <> m2.id and m1.id < m2.idgroup by
>> m1.id, m2.id, m1.dt;
>>
>>
>>
>>
>>
>>
>>
>> *From:* Defenestrator [mailto: 
>> <[email protected]><[email protected]>
>> [email protected]]
>> *Sent:* Tuesday, November 03, 2009 4:44 PM
>> *To:* <[email protected]> <[email protected]>
>> [email protected]
>> *Subject:* Self join problem
>>
>>
>>
>> Hello,
>>
>>
>>
>> I'm trying to run the following query where m1 and m2 have the same data
>> (>29M rows) on a 3-node hadoop cluster.  I'm essentially trying to do a self
>> join.  It ends up running 269 map jobs and 1 reduce job.  The map jobs
>> complete but the reduce job just runs on one process on one of the hadoop
>> nodes at 100% cpu utilization and just slowly increases in memory
>> consumption.  The reduce job never goes beyond 82% complete despite letting
>> it run for a day.
>>
>>
>>
>> I am running on 0.5.0 based on this morning's trunk.
>>
>>
>>
>> insert overwrite table foo1
>>
>> select m1.id as id_1, m2.id as id_2, count(1), m1.dt
>>
>> from m1 join m2 where m1.id <> m2.id and m1.id < m2.id and m1.dt = m2.dt
>> group by m1.id, m2.id, m1.dt;
>>
>>
>>
>> Any input would be appreciated.
>>
>>
>>
>
>

Re: Self join problem

Reply via email to