Hi Namit, The join column is the first column of the data. Here are the queries to recreate the error.
create table mydata (mid INT, cid INT, rating INT) stored as SEQUENCEFILE; create external table mydata_stg(mid INT, cid INT, rating INT, date1 STRING) row format delimited fields terminated by '\054' lines terminated by '\012' stored as textfile location '/user/sudipto/data/hive/text1'; from mydata_stg stg insert overwrite table mydata select stg.mid, stg.cid, stg.rating; create table param(mid INT, param DOUBLE) stored as SEQUENCEFILE; insert overwrite table param select mid, pow(avg(rating), 0.5) from mydata group by mid; create table result(cid INT, mid INT, param DOUBLE, rating DOUBLE) stored as SEQUENCEFILE; insert overwrite table result select /*+ MAPJOIN(m)*/ r.cid, m.mid, m.param, r.rating from mydata r JOIN param m ON (r.mid = m.mid); Thanks Sudipto PhD Candidate CS @ UCSB Santa Barbara, CA 93106, USA http://www.cs.ucsb.edu/~sudipto On Thu, Sep 10, 2009 at 6:05 PM, Namit Jain <[email protected]> wrote: > The data you have sent has 4 columns – > > > > insert overwrite table join_result > select /*+ MAPJOIN(m)*/ m.mid, m.param, r.rating > from data r JOIN param m ON (r.mid = m.mid); > > > > Which is the join column - > > > > > > *From:* Sudipto Das [mailto:[email protected]] > *Sent:* Thursday, September 10, 2009 3:26 PM > > *To:* [email protected] > *Subject:* Re: Directing Hive to perform Hash Join for small inner tables > > > > Hi Namit, > > I was able to create a smaller data set to replicate the error. Please see > the attached file. Put the file in HDFS, and run the queries (make sure you > update the path for the external table to point to directory in HDFS that > contains this file). I hope you can see the same error. I am using Hadoop > 0.18.3, and Hive trunk r812721. > > Thanks > Sudipto > > PhD Candidate > CS @ UCSB > Santa Barbara, CA 93106, USA > http://www.cs.ucsb.edu/~sudipto <http://www.cs.ucsb.edu/%7Esudipto> > > On Wed, Sep 9, 2009 at 7:12 PM, Namit Jain <[email protected]> wrote: > > If possible, can you send the entire dataset ? > > > > > > *From:* Sudipto Das [mailto:[email protected]] > *Sent:* Wednesday, September 09, 2009 7:10 PM > > > *To:* [email protected] > *Subject:* Re: Directing Hive to perform Hash Join for small inner tables > > > > Hi, > > This seems to be one of those sneaky bugs... the query works for the > smaller sets of data which I tried creating from the original data set. But > as soon as I move to the entire data set, the error comes back. I tried it > with another table in my data set, but the error persists. It only vanishes > for the subset.. I can't think of a good way of creating a test data set. I > was just taking the first 100 - 1000 rows.. > > I am attaching all the queries and the explain. Let me know if it helps: > > Thanks > Sudipto > > PhD Candidate > CS @ UCSB > Santa Barbara, CA 93106, USA > http://www.cs.ucsb.edu/~sudipto <http://www.cs.ucsb.edu/%7Esudipto> > > On Wed, Sep 9, 2009 at 6:13 PM, Namit Jain <[email protected]> wrote: > > Can you provide a small testcase ? > > > > > > *From:* Sudipto Das [mailto:[email protected]] > *Sent:* Wednesday, September 09, 2009 2:20 PM > > > *To:* [email protected] > > *Subject:* Re: Directing Hive to perform Hash Join for small inner tables > > > > Hi, > > Thanks for the quick response. I tried the query: > > insert overwrite table join_result > select /*+ MAPJOIN(m)*/ m.mid, m.param, r.rating > from data r JOIN param m ON (r.mid = m.mid); > > param has only 17k rows with 2 columns. > > I got this exception > > java.lang.RuntimeException > > > > at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:182) > > > > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47) > > > > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227) > > > > > > > > > > at > org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198) > > > > Caused by: java.lang.NullPointerException > > > > at > org.apache.hadoop.hive.ql.exec.CommonJoinOperator.createForwardJoinObject(CommonJoinOperator.java:283) > > > > > > > > > > at > org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genObject(CommonJoinOperator.java:530) > > > > at > org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genObject(CommonJoinOperator.java:519) > > > > at > org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genObject(CommonJoinOperator.java:519) > > > > > > > > > > at > org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:560) > > > > at > org.apache.hadoop.hive.ql.exec.MapJoinOperator.processOp(MapJoinOperator.java:299) > > > > at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:374) > > > > > > > > > > at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:580) > > > > at > org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:42) > > > > at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:374) > > > > > > > > > > at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:580) > > > > at > org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:320) > > > > at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:165) > > > > > > > > > > ... 3 more > > > Additionally, the query compiled into two MR jobs. The 2nd one didn't start > because the first failed, but I couldn't reason about the 2nd job. > > I am using Hive trunk, revision 811082 updated on 09/03. > > Thanks > Sudipto > > PhD Candidate > CS @ UCSB > Santa Barbara, CA 93106, USA > http://www.cs.ucsb.edu/~sudipto <http://www.cs.ucsb.edu/%7Esudipto> > > On Wed, Sep 9, 2009 at 2:03 PM, Namit Jain <[email protected]> wrote: > > You can specify it as a hint in the select list: > > > > > > select /*+ MAPJOIN(b) */ … from T a JOIN T2 b on … > > > > > > In the example above, T2 is the small table which can be cached in memory > > > > > > > > > > *From:* [email protected] [mailto:[email protected]] *On Behalf Of > *Sudipto > Das > *Sent:* Wednesday, September 09, 2009 2:01 PM > *To:* [email protected] > *Subject:* Directing Hive to perform Hash Join for small inner tables > > > > Hi, > > I am new to hive so pardon me if this is something very obvious which I > might have missed in the documentation. > > I have an application where I am joining a small inner table with a really > large outer table. The inner table is small enough to fit into memory at > each mapper. In such a case, putting the inner table into an in-memory hash > table and performing a hash based join is much more efficient than > performing the sort-merge join which the JOIN operator selects. Is there a > way in Hive where I can instruct it perform the hash based join? > > Thanks > > Sudipto > > PhD Candidate > CS @ UCSB > Santa Barbara, CA 93106, USA > http://www.cs.ucsb.edu/~sudipto <http://www.cs.ucsb.edu/%7Esudipto> > > > > > > >
