Re: Directing Hive to perform Hash Join for small inner tables

Sudipto Das Fri, 11 Sep 2009 13:17:39 -0700

Is there any performance difference between 0.4 and trunk? I can then
temporarily switch to 0.4 while the problem is being fixed.


Also, can I specify two tables in the MAPJOIN hint when doing a 3 way join,
where two tables can fit in memory. I tried some intuitive alternatives but
did not work. The only option I have now is to have 2 separate JOINs. Since
this is a map only join, and the user knows that the inner table(s) will fit
in memory, so allowing multiple tables will result in the JOIN being done in
a single Map only job, and will also allow Combiners and Reducers to be put
in if the query is doing some aggregation on the JOINed result.

Thanks
Sudipto

PhD Candidate
CS @ UCSB
Santa Barbara, CA 93106, USA
http://www.cs.ucsb.edu/~sudipto


On Fri, Sep 11, 2009 at 12:12 PM, Namit Jain <[email protected]> wrote:

>  This is some problem in trunk – it runs fine in 0.4 – I will take a look
>
>
>
> *From:* Sudipto Das [mailto:[email protected]]
> *Sent:* Thursday, September 10, 2009 6:16 PM
>
> *To:* [email protected]
> *Subject:* Re: Directing Hive to perform Hash Join for small inner tables
>
>
>
> Hi Namit,
>
> The join column is the first column of the data. Here are the queries to
> recreate the error.
>
> create table mydata (mid INT, cid INT,
> rating INT)
> stored as SEQUENCEFILE;
>
> create external table mydata_stg(mid INT, cid INT,
> rating INT, date1 STRING)
> row format delimited fields terminated by '\054' lines terminated by '\012'
> stored as textfile
> location '/user/sudipto/data/hive/text1';
>
> from mydata_stg stg
> insert overwrite table mydata
> select stg.mid, stg.cid, stg.rating;
>
> create table param(mid INT, param DOUBLE)
> stored as SEQUENCEFILE;
>
> insert overwrite table param
> select mid, pow(avg(rating), 0.5)
> from mydata
> group by mid;
>
> create table result(cid INT, mid INT, param DOUBLE, rating DOUBLE)
> stored as SEQUENCEFILE;
>
> insert overwrite table result
> select /*+ MAPJOIN(m)*/ r.cid, m.mid, m.param, r.rating
> from mydata r JOIN param m ON (r.mid = m.mid);
>
> Thanks
> Sudipto
>
> PhD Candidate
> CS @ UCSB
> Santa Barbara, CA 93106, USA
> http://www.cs.ucsb.edu/~sudipto <http://www.cs.ucsb.edu/%7Esudipto>
>
>  On Thu, Sep 10, 2009 at 6:05 PM, Namit Jain <[email protected]> wrote:
>
> The data you have sent has 4 columns –
>
>
>
> insert overwrite table join_result
> select /*+ MAPJOIN(m)*/ m.mid, m.param, r.rating
> from data r JOIN param m ON (r.mid = m.mid);
>
>
>
> Which is the join column -
>
>
>
>
>
> *From:* Sudipto Das [mailto:[email protected]]
> *Sent:* Thursday, September 10, 2009 3:26 PM
>
>
> *To:* [email protected]
> *Subject:* Re: Directing Hive to perform Hash Join for small inner tables
>
>
>
> Hi Namit,
>
> I was able to create a smaller data set to replicate the error. Please see
> the attached file. Put the file in HDFS, and run the queries (make sure you
> update the path for the external table to point to directory in HDFS that
> contains this file). I hope you can see the same error. I am using Hadoop
> 0.18.3, and Hive trunk r812721.
>
> Thanks
> Sudipto
>
> PhD Candidate
> CS @ UCSB
> Santa Barbara, CA 93106, USA
> http://www.cs.ucsb.edu/~sudipto <http://www.cs.ucsb.edu/%7Esudipto>
>
> On Wed, Sep 9, 2009 at 7:12 PM, Namit Jain <[email protected]> wrote:
>
> If possible, can you send the entire dataset ?
>
>
>
>
>
> *From:* Sudipto Das [mailto:[email protected]]
> *Sent:* Wednesday, September 09, 2009 7:10 PM
>
>
> *To:* [email protected]
> *Subject:* Re: Directing Hive to perform Hash Join for small inner tables
>
>
>
> Hi,
>
> This seems to be one of those sneaky bugs... the query works for the
> smaller sets of data which I tried creating from the original data set. But
> as soon as I move to the entire data set, the error comes back. I tried it
> with another table in my data set, but the error persists. It only vanishes
> for the subset.. I can't think of a good way of creating a test data set. I
> was just taking the first 100 - 1000 rows..
>
> I am attaching all the queries and the explain. Let me know if it helps:
>
> Thanks
> Sudipto
>
> PhD Candidate
> CS @ UCSB
> Santa Barbara, CA 93106, USA
> http://www.cs.ucsb.edu/~sudipto <http://www.cs.ucsb.edu/%7Esudipto>
>
> On Wed, Sep 9, 2009 at 6:13 PM, Namit Jain <[email protected]> wrote:
>
> Can you provide a small testcase ?
>
>
>
>
>
> *From:* Sudipto Das [mailto:[email protected]]
> *Sent:* Wednesday, September 09, 2009 2:20 PM
>
>
> *To:* [email protected]
>
> *Subject:* Re: Directing Hive to perform Hash Join for small inner tables
>
>
>
> Hi,
>
> Thanks for the quick response. I tried the query:
>
> insert overwrite table join_result
> select /*+ MAPJOIN(m)*/ m.mid, m.param, r.rating
> from data r JOIN param m ON (r.mid = m.mid);
>
> param has only 17k rows with 2 columns.
>
> I got this exception
>
> java.lang.RuntimeException
>
>
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:182)
>
>
>
>
>
>
>
>         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
>
>
>
>
>
>
>
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at 
> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198)
>
>
>
>
>
>
>
> Caused by: java.lang.NullPointerException
>
>
>
>
>
>
>
>         at 
> org.apache.hadoop.hive.ql.exec.CommonJoinOperator.createForwardJoinObject(CommonJoinOperator.java:283)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at 
> org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genObject(CommonJoinOperator.java:530)
>
>
>
>
>
>
>
>         at 
> org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genObject(CommonJoinOperator.java:519)
>
>
>
>
>
>
>
>         at 
> org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genObject(CommonJoinOperator.java:519)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at 
> org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:560)
>
>
>
>
>
>
>
>         at 
> org.apache.hadoop.hive.ql.exec.MapJoinOperator.processOp(MapJoinOperator.java:299)
>
>
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:374)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:580)
>
>
>
>
>
>
>
>         at 
> org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:42)
>
>
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:374)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:580)
>
>
>
>
>
>
>
>         at 
> org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:320)
>
>
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:165)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         ... 3 more
>
>
> Additionally, the query compiled into two MR jobs. The 2nd one didn't start
> because the first failed, but I couldn't reason about the 2nd job.
>
> I am using Hive trunk, revision 811082 updated on 09/03.
>
> Thanks
> Sudipto
>
> PhD Candidate
> CS @ UCSB
> Santa Barbara, CA 93106, USA
> http://www.cs.ucsb.edu/~sudipto <http://www.cs.ucsb.edu/%7Esudipto>
>
> On Wed, Sep 9, 2009 at 2:03 PM, Namit Jain <[email protected]> wrote:
>
> You can specify it as a hint in the select list:
>
>
>
>
>
> select /*+ MAPJOIN(b) */  …   from T a JOIN T2 b on …
>
>
>
>
>
> In the example above, T2 is the small table which can be cached in memory
>
>
>
>
>
>
>
>
>
> *From:* [email protected] [mailto:[email protected]] *On Behalf Of 
> *Sudipto
> Das
> *Sent:* Wednesday, September 09, 2009 2:01 PM
> *To:* [email protected]
> *Subject:* Directing Hive to perform Hash Join for small inner tables
>
>
>
> Hi,
>
> I am new to hive so pardon me if this is something very obvious which I
> might have missed in the documentation.
>
> I have an application where I am joining a small inner table with a really
> large outer table. The inner table is small enough to fit into memory at
> each mapper. In such a case, putting the inner table into an in-memory hash
> table and performing a hash based join is much more efficient than
> performing the sort-merge join which the JOIN operator selects. Is there a
> way in Hive where I can instruct it perform the hash based join?
>
> Thanks
>
> Sudipto
>
> PhD Candidate
> CS @ UCSB
> Santa Barbara, CA 93106, USA
> http://www.cs.ucsb.edu/~sudipto <http://www.cs.ucsb.edu/%7Esudipto>
>
>
>
>
>
>
>
>
>

Re: Directing Hive to perform Hash Join for small inner tables

Reply via email to