Can you provide a small testcase ?
From: Sudipto Das [mailto:[email protected]] Sent: Wednesday, September 09, 2009 2:20 PM To: [email protected] Subject: Re: Directing Hive to perform Hash Join for small inner tables Hi, Thanks for the quick response. I tried the query: insert overwrite table join_result select /*+ MAPJOIN(m)*/ m.mid, m.param, r.rating from data r JOIN param m ON (r.mid = m.mid); param has only 17k rows with 2 columns. I got this exception java.lang.RuntimeException at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:182) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198) Caused by: java.lang.NullPointerException at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.createForwardJoinObject(CommonJoinOperator.java:283) at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genObject(CommonJoinOperator.java:530) at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genObject(CommonJoinOperator.java:519) at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genObject(CommonJoinOperator.java:519) at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:560) at org.apache.hadoop.hive.ql.exec.MapJoinOperator.processOp(MapJoinOperator.java:299) at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:374) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:580) at org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:42) at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:374) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:580) at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:320) at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:165) ... 3 more Additionally, the query compiled into two MR jobs. The 2nd one didn't start because the first failed, but I couldn't reason about the 2nd job. I am using Hive trunk, revision 811082 updated on 09/03. Thanks Sudipto PhD Candidate CS @ UCSB Santa Barbara, CA 93106, USA http://www.cs.ucsb.edu/~sudipto On Wed, Sep 9, 2009 at 2:03 PM, Namit Jain <[email protected]<mailto:[email protected]>> wrote: You can specify it as a hint in the select list: select /*+ MAPJOIN(b) */ ... from T a JOIN T2 b on ... In the example above, T2 is the small table which can be cached in memory From: [email protected]<mailto:[email protected]> [mailto:[email protected]<mailto:[email protected]>] On Behalf Of Sudipto Das Sent: Wednesday, September 09, 2009 2:01 PM To: [email protected]<mailto:[email protected]> Subject: Directing Hive to perform Hash Join for small inner tables Hi, I am new to hive so pardon me if this is something very obvious which I might have missed in the documentation. I have an application where I am joining a small inner table with a really large outer table. The inner table is small enough to fit into memory at each mapper. In such a case, putting the inner table into an in-memory hash table and performing a hash based join is much more efficient than performing the sort-merge join which the JOIN operator selects. Is there a way in Hive where I can instruct it perform the hash based join? Thanks Sudipto PhD Candidate CS @ UCSB Santa Barbara, CA 93106, USA http://www.cs.ucsb.edu/~sudipto<http://www.cs.ucsb.edu/%7Esudipto>
