Re: Directing Hive to perform Hash Join for small inner tables

Sudipto Das Fri, 11 Sep 2009 16:33:06 -0700

Thanks. I switched to branch 0.4, and the hash join is working, even though
it is running much slower that I expected. I will try to figure out the
reason.


PhD Candidate
CS @ UCSB
Santa Barbara, CA 93106, USA
http://www.cs.ucsb.edu/~sudipto


On Fri, Sep 11, 2009 at 3:13 PM, Namit Jain <[email protected]> wrote:

>  Your join keys are different – in that case, we don’t merge currently
>
>
>
>
>
> *From:* Sudipto Das [mailto:[email protected]]
> *Sent:* Friday, September 11, 2009 3:10 PM
>
> *To:* [email protected]
> *Subject:* Re: Directing Hive to perform Hash Join for small inner tables
>
>
>
> I checked join26.q. That wasn't what I meant. My query looks something like
> this:
>
> insert overwrite table join_result
> select /*+ MAPJOIN(c,m)*/ c.cid, m.mid, c.cparam, m.mparam, r.rate, (r.rate
> - c.cparam*m.mparam)
> from mparam m JOIN data r ON (m.mid = r.mid)
> JOIN cparam c ON (c.cid = r.cid);
>
> mparam and cparam are two inner tables, which are small enough to be loaded
> into memory together, whereas data is a really huge table, so I want to
> avoid the cost of making two passes over it. So I want a plan where mparam
> and cparam are hashed in memory, and then data is streamed through.
>
> PhD Candidate
> CS @ UCSB
> Santa Barbara, CA 93106, USA
> http://www.cs.ucsb.edu/~sudipto <http://www.cs.ucsb.edu/%7Esudipto>
>
>  On Fri, Sep 11, 2009 at 2:22 PM, Namit Jain <[email protected]> wrote:
>
> That is not true – look at the unit test join26.q
>
>
>
>
>
> *From:* Sudipto Das [mailto:[email protected]]
> *Sent:* Friday, September 11, 2009 2:07 PM
>
>
> *To:* [email protected]
> *Subject:* Re: Directing Hive to perform Hash Join for small inner tables
>
>
>
> But this creates a join where each join is performed in a single Map only
> MR join, which is as good as specifying the query as MAPJOIN(x) followed by
> another query as MAPJOIN(y) with the result of the previous join. Is there
> as way to make it pick just one MR job, where both the small inner tables
> are put in memory, and only one pass is done over the data?
>
> PhD Candidate
> CS @ UCSB
> Santa Barbara, CA 93106, USA
> http://www.cs.ucsb.edu/~sudipto <http://www.cs.ucsb.edu/%7Esudipto>
>
> On Fri, Sep 11, 2009 at 1:29 PM, Namit Jain <[email protected]> wrote:
>
> Yes, you can specify the list of tables in the hint
>
>
>
>
>
> MAPJOIN(x,y,z)
>
>
>
>
>
>
>
>
>
>
>
> *From:* Sudipto Das [mailto:[email protected]]
> *Sent:* Friday, September 11, 2009 1:17 PM
>
>
> *To:* [email protected]
> *Subject:* Re: Directing Hive to perform Hash Join for small inner tables
>
>
>
> Is there any performance difference between 0.4 and trunk? I can then
> temporarily switch to 0.4 while the problem is being fixed.
>
> Also, can I specify two tables in the MAPJOIN hint when doing a 3 way join,
> where two tables can fit in memory. I tried some intuitive alternatives but
> did not work. The only option I have now is to have 2 separate JOINs. Since
> this is a map only join, and the user knows that the inner table(s) will fit
> in memory, so allowing multiple tables will result in the JOIN being done in
> a single Map only job, and will also allow Combiners and Reducers to be put
> in if the query is doing some aggregation on the JOINed result.
>
> Thanks
> Sudipto
>
> PhD Candidate
> CS @ UCSB
> Santa Barbara, CA 93106, USA
> http://www.cs.ucsb.edu/~sudipto <http://www.cs.ucsb.edu/%7Esudipto>
>
> On Fri, Sep 11, 2009 at 12:12 PM, Namit Jain <[email protected]> wrote:
>
> This is some problem in trunk – it runs fine in 0.4 – I will take a look
>
>
>
> *From:* Sudipto Das [mailto:[email protected]]
> *Sent:* Thursday, September 10, 2009 6:16 PM
>
>
> *To:* [email protected]
> *Subject:* Re: Directing Hive to perform Hash Join for small inner tables
>
>
>
> Hi Namit,
>
> The join column is the first column of the data. Here are the queries to
> recreate the error.
>
> create table mydata (mid INT, cid INT,
> rating INT)
> stored as SEQUENCEFILE;
>
> create external table mydata_stg(mid INT, cid INT,
> rating INT, date1 STRING)
> row format delimited fields terminated by '\054' lines terminated by '\012'
> stored as textfile
> location '/user/sudipto/data/hive/text1';
>
> from mydata_stg stg
> insert overwrite table mydata
> select stg.mid, stg.cid, stg.rating;
>
> create table param(mid INT, param DOUBLE)
> stored as SEQUENCEFILE;
>
> insert overwrite table param
> select mid, pow(avg(rating), 0.5)
> from mydata
> group by mid;
>
> create table result(cid INT, mid INT, param DOUBLE, rating DOUBLE)
> stored as SEQUENCEFILE;
>
> insert overwrite table result
> select /*+ MAPJOIN(m)*/ r.cid, m.mid, m.param, r.rating
> from mydata r JOIN param m ON (r.mid = m.mid);
>
> Thanks
> Sudipto
>
> PhD Candidate
> CS @ UCSB
> Santa Barbara, CA 93106, USA
> http://www.cs.ucsb.edu/~sudipto <http://www.cs.ucsb.edu/%7Esudipto>
>
> On Thu, Sep 10, 2009 at 6:05 PM, Namit Jain <[email protected]> wrote:
>
> The data you have sent has 4 columns –
>
>
>
> insert overwrite table join_result
> select /*+ MAPJOIN(m)*/ m.mid, m.param, r.rating
> from data r JOIN param m ON (r.mid = m.mid);
>
>
>
> Which is the join column -
>
>
>
>
>
> *From:* Sudipto Das [mailto:[email protected]]
> *Sent:* Thursday, September 10, 2009 3:26 PM
>
>
> *To:* [email protected]
> *Subject:* Re: Directing Hive to perform Hash Join for small inner tables
>
>
>
> Hi Namit,
>
> I was able to create a smaller data set to replicate the error. Please see
> the attached file. Put the file in HDFS, and run the queries (make sure you
> update the path for the external table to point to directory in HDFS that
> contains this file). I hope you can see the same error. I am using Hadoop
> 0.18.3, and Hive trunk r812721.
>
> Thanks
> Sudipto
>
> PhD Candidate
> CS @ UCSB
> Santa Barbara, CA 93106, USA
> http://www.cs.ucsb.edu/~sudipto <http://www.cs.ucsb.edu/%7Esudipto>
>
> On Wed, Sep 9, 2009 at 7:12 PM, Namit Jain <[email protected]> wrote:
>
> If possible, can you send the entire dataset ?
>
>
>
>
>
> *From:* Sudipto Das [mailto:[email protected]]
> *Sent:* Wednesday, September 09, 2009 7:10 PM
>
>
> *To:* [email protected]
> *Subject:* Re: Directing Hive to perform Hash Join for small inner tables
>
>
>
> Hi,
>
> This seems to be one of those sneaky bugs... the query works for the
> smaller sets of data which I tried creating from the original data set. But
> as soon as I move to the entire data set, the error comes back. I tried it
> with another table in my data set, but the error persists. It only vanishes
> for the subset.. I can't think of a good way of creating a test data set. I
> was just taking the first 100 - 1000 rows..
>
> I am attaching all the queries and the explain. Let me know if it helps:
>
> Thanks
> Sudipto
>
> PhD Candidate
> CS @ UCSB
> Santa Barbara, CA 93106, USA
> http://www.cs.ucsb.edu/~sudipto <http://www.cs.ucsb.edu/%7Esudipto>
>
> On Wed, Sep 9, 2009 at 6:13 PM, Namit Jain <[email protected]> wrote:
>
> Can you provide a small testcase ?
>
>
>
>
>
> *From:* Sudipto Das [mailto:[email protected]]
> *Sent:* Wednesday, September 09, 2009 2:20 PM
>
>
> *To:* [email protected]
>
> *Subject:* Re: Directing Hive to perform Hash Join for small inner tables
>
>
>
> Hi,
>
> Thanks for the quick response. I tried the query:
>
> insert overwrite table join_result
> select /*+ MAPJOIN(m)*/ m.mid, m.param, r.rating
> from data r JOIN param m ON (r.mid = m.mid);
>
> param has only 17k rows with 2 columns.
>
> I got this exception
>
> java.lang.RuntimeException
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:182)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at 
> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> Caused by: java.lang.NullPointerException
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at 
> org.apache.hadoop.hive.ql.exec.CommonJoinOperator.createForwardJoinObject(CommonJoinOperator.java:283)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at 
> org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genObject(CommonJoinOperator.java:530)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at 
> org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genObject(CommonJoinOperator.java:519)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at 
> org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genObject(CommonJoinOperator.java:519)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at 
> org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:560)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at 
> org.apache.hadoop.hive.ql.exec.MapJoinOperator.processOp(MapJoinOperator.java:299)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:374)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:580)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at 
> org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:42)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:374)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:580)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at 
> org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:320)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:165)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         ... 3 more
>
>
> Additionally, the query compiled into two MR jobs. The 2nd one didn't start
> because the first failed, but I couldn't reason about the 2nd job.
>
> I am using Hive trunk, revision 811082 updated on 09/03.
>
> Thanks
> Sudipto
>
> PhD Candidate
> CS @ UCSB
> Santa Barbara, CA 93106, USA
> http://www.cs.ucsb.edu/~sudipto <http://www.cs.ucsb.edu/%7Esudipto>
>
> On Wed, Sep 9, 2009 at 2:03 PM, Namit Jain <[email protected]> wrote:
>
> You can specify it as a hint in the select list:
>
>
>
>
>
> select /*+ MAPJOIN(b) */  …   from T a JOIN T2 b on …
>
>
>
>
>
> In the example above, T2 is the small table which can be cached in memory
>
>
>
>
>
>
>
>
>
> *From:* [email protected] [mailto:[email protected]] *On Behalf Of 
> *Sudipto
> Das
> *Sent:* Wednesday, September 09, 2009 2:01 PM
> *To:* [email protected]
> *Subject:* Directing Hive to perform Hash Join for small inner tables
>
>
>
> Hi,
>
> I am new to hive so pardon me if this is something very obvious which I
> might have missed in the documentation.
>
> I have an application where I am joining a small inner table with a really
> large outer table. The inner table is small enough to fit into memory at
> each mapper. In such a case, putting the inner table into an in-memory hash
> table and performing a hash based join is much more efficient than
> performing the sort-merge join which the JOIN operator selects. Is there a
> way in Hive where I can instruct it perform the hash based join?
>
> Thanks
>
> Sudipto
>
> PhD Candidate
> CS @ UCSB
> Santa Barbara, CA 93106, USA
> http://www.cs.ucsb.edu/~sudipto <http://www.cs.ucsb.edu/%7Esudipto>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

Re: Directing Hive to perform Hash Join for small inner tables

Reply via email to