RE: Directing Hive to perform Hash Join for small inner tables

Namit Jain Wed, 09 Sep 2009 19:13:04 -0700

If possible, can you send the entire dataset ?

From: Sudipto Das [mailto:[email protected]]
Sent: Wednesday, September 09, 2009 7:10 PM
To: [email protected]
Subject: Re: Directing Hive to perform Hash Join for small inner tables

Hi,

This seems to be one of those sneaky bugs... the query works for the smaller 
sets of data which I tried creating from the original data set. But as soon as 
I move to the entire data set, the error comes back. I tried it with another 
table in my data set, but the error persists. It only vanishes for the subset.. 
I can't think of a good way of creating a test data set. I was just taking the 
first 100 - 1000 rows..

I am attaching all the queries and the explain. Let me know if it helps:

Thanks
Sudipto

PhD Candidate
CS @ UCSB
Santa Barbara, CA 93106, USA
http://www.cs.ucsb.edu/~sudipto

On Wed, Sep 9, 2009 at 6:13 PM, Namit Jain 
<[email protected]<mailto:[email protected]>> wrote:

Can you provide a small testcase ?

From: Sudipto Das [mailto:[email protected]<mailto:[email protected]>]
Sent: Wednesday, September 09, 2009 2:20 PM

To: [email protected]<mailto:[email protected]>
Subject: Re: Directing Hive to perform Hash Join for small inner tables

Hi,

Thanks for the quick response. I tried the query:

insert overwrite table join_result
select /*+ MAPJOIN(m)*/ m.mid, m.param, r.rating
from data r JOIN param m ON (r.mid = m.mid);

param has only 17k rows with 2 columns.

I got this exception

java.lang.RuntimeException

        at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:182)

        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)

        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)

        at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198)

Caused by: java.lang.NullPointerException

        at 
org.apache.hadoop.hive.ql.exec.CommonJoinOperator.createForwardJoinObject(CommonJoinOperator.java:283)

        at 
org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genObject(CommonJoinOperator.java:530)

        at 
org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genObject(CommonJoinOperator.java:519)

        at 
org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genObject(CommonJoinOperator.java:519)

        at 
org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:560)

        at 
org.apache.hadoop.hive.ql.exec.MapJoinOperator.processOp(MapJoinOperator.java:299)

        at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:374)

        at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:580)

        at 
org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:42)

        at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:374)

        at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:580)

        at 
org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:320)

        at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:165)

        ... 3 more

Additionally, the query compiled into two MR jobs. The 2nd one didn't start 
because the first failed, but I couldn't reason about the 2nd job.

I am using Hive trunk, revision 811082 updated on 09/03.

Thanks
Sudipto

PhD Candidate
CS @ UCSB
Santa Barbara, CA 93106, USA
http://www.cs.ucsb.edu/~sudipto<http://www.cs.ucsb.edu/%7Esudipto>

On Wed, Sep 9, 2009 at 2:03 PM, Namit Jain 
<[email protected]<mailto:[email protected]>> wrote:

You can specify it as a hint in the select list:

select /*+ MAPJOIN(b) */  ...   from T a JOIN T2 b on ...

In the example above, T2 is the small table which can be cached in memory

From: [email protected]<mailto:[email protected]> 
[mailto:[email protected]<mailto:[email protected]>] On Behalf Of Sudipto 
Das
Sent: Wednesday, September 09, 2009 2:01 PM
To: [email protected]<mailto:[email protected]>
Subject: Directing Hive to perform Hash Join for small inner tables

Hi,

I am new to hive so pardon me if this is something very obvious which I might 
have missed in the documentation.

I have an application where I am joining a small inner table with a really 
large outer table. The inner table is small enough to fit into memory at each 
mapper. In such a case, putting the inner table into an in-memory hash table 
and performing a hash based join is much more efficient than performing the 
sort-merge join which the JOIN operator selects. Is there a way in Hive where I 
can instruct it perform the hash based join?

Thanks

Sudipto

PhD Candidate
CS @ UCSB
Santa Barbara, CA 93106, USA
http://www.cs.ucsb.edu/~sudipto<http://www.cs.ucsb.edu/%7Esudipto>

RE: Directing Hive to perform Hash Join for small inner tables

Reply via email to