Re: Review Request 29281: HIVE-8640 : Support hints of SMBJoin [Spark Branch]

Szehon Ho Fri, 19 Dec 2014 18:04:01 -0800


> On Dec. 20, 2014, 1:36 a.m., Xuefu Zhang wrote:
> > Patch looks good and clean. However, I have two high-level questions:
> > 1. what's a SMB hint? For map-join, hint is /*+ MAPJOIN(X)*/. It doesn't 
> > seem even close for SMB hint.
> > 2. from the code, it seems we only try to convert a map join to bucket map 
> > join and further to SMB join. But does it make sense (out of the scope of 
> > this JIRA) that we can convert a join to bucket map join even if the join 
> > cannot be converted to map join?


Thanks Xuefu for looking at this!  Answers inline.

1.  They reuse the same syntax.  For example in that scenario (x) is the 'small 
table' of the SMB join.  It is the one that is loaded into each mapper.
2.  Bucket join is an advanced case of mapjoin.  In other words, a bucket join 
is always better.  And I think bucket join in terms of the operator-tree is 
just like mapjoin but with some metadata about buckets as far as I understand 
so the progression makes sense.


> On Dec. 20, 2014, 1:36 a.m., Xuefu Zhang wrote:
> > itests/src/test/resources/testconfiguration.properties, line 545
> > <https://reviews.apache.org/r/29281/diff/1/?file=797796#file797796line545>
> >
> >     I think some of the "wrong" ordering is caused by different 
> > understanding of ordering. For instance, mapjoin1.q -> mapjoin10.q -> 
> > mapjoin11.q -> mapjoin12.q -> mapjoin2.q, or  mapjoin1.q -> mapjoin2.q -> 
> > mapjoin10.q -> mapjoin11.q -> mapjoin12.q. Since last reordering, we only 
> > added a few tests manually, which wouldn't have caused so much disrruption.

I ran Brock's script provided in the comments of testconfiguration.properties 
that ensures the order.  It is based on unix sort.  I thought it is the one we 
agreed to use?


> On Dec. 20, 2014, 1:36 a.m., Xuefu Zhang wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkSMBJoinHintOptimizer.java,
> >  line 39
> > <https://reviews.apache.org/r/29281/diff/1/?file=797799#file797799line39>
> >
> >     The rule matches only MapJoinOperator, is it possible for nd a 
> > SMBMapJoinOperator?

It's a weird check that I copied from the mapreduce side.  It should not happen 
in our use-case (or even their's).  I dont think it hurts though.  Should I get 
rid of it?


> On Dec. 20, 2014, 1:36 a.m., Xuefu Zhang wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java, line 
> > 129
> > <https://reviews.apache.org/r/29281/diff/1/?file=797801#file797801line129>
> >
> >     Just curious. The previous rule might change JoinOperator to 
> > MapJoinOperator during the walking. is this rule able to process these 
> > converted MapJoinOperators in the same walk?

Only one rule runs per operator, so the output of that wont get fed into this.  
It is based on the operator node at the time of traversal.


- Szehon


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29281/#review65714
-----------------------------------------------------------


On Dec. 20, 2014, 12:01 a.m., Szehon Ho wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/29281/
> -----------------------------------------------------------
> 
> (Updated Dec. 20, 2014, 12:01 a.m.)
> 
> 
> Review request for hive.
> 
> 
> Bugs: HIVE-8640
>     https://issues.apache.org/jira/browse/HIVE-8640
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> This change is on the same principle as the refactoring of HIVE-8639.  The 
> goal is to move as much of the join optimization as possible to the same 
> traversal, and in fact the same process(joinOp) method, to simplify the logic 
> and also for compiler performance.
> 
> Whereas it is too hard to bring SparkMapJoinProcessor (for mapjoin hints) 
> into the same level due to the way it was written (see HIVE-8911), it is 
> possible to bring Bucket join and SMB join hints to the same level.  This 
> change introduces a parallel processor called 'SparkJoinHintOptimizer', which 
> takes a mapjoin already converted by SparkMapJoinProcessor as input and 
> converts it to Bucket/SMB join accordingly.  It runs alongside 
> 'SparkJoinOptimizer' which takes a common join operator and handles the 
> auto-conversion to mapjoin/bucketJoin/SMBJoin.
> 
> The one difference between mapjoin/bucketJoin vs SMB as Chao found was that 
> while Spark mapjoins expect RS for small-table branches in 
> mapjoin/bucketJoin, this is not expected for SMB join.  So I added a class 
> SparkSMBHintJoinOptimizer that first removes this before re-using the rest of 
> the existing code.
> 
> Another issue was found in NonBlockingOpDeDupProc that corrupts 
> 'mapJoinContext' data structure in the parse context.  A fix is offered in 
> HIVE-9117 and that should be committed to trunk and merged first, but it is 
> included here for reference.
> 
> 
> Diffs
> -----
> 
>   itests/src/test/resources/testconfiguration.properties fd732c1 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/NonBlockingOpDeDupProc.java 
> 5e0959a 
>   
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkJoinHintOptimizer.java
>  PRE-CREATION 
>   
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkSMBJoinHintOptimizer.java
>  PRE-CREATION 
>   
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkSortMergeJoinOptimizer.java
>  6a47513 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java 
> 5227d92 
>   ql/src/test/results/clientpositive/spark/smb_mapjoin9.q.out d769ebe 
>   ql/src/test/results/clientpositive/spark/smb_mapjoin_1.q.out 8d0527e 
>   ql/src/test/results/clientpositive/spark/smb_mapjoin_10.q.out 2df87cf 
>   ql/src/test/results/clientpositive/spark/smb_mapjoin_11.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/spark/smb_mapjoin_12.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/spark/smb_mapjoin_13.q.out 5637206 
>   ql/src/test/results/clientpositive/spark/smb_mapjoin_14.q.out 3aed084 
>   ql/src/test/results/clientpositive/spark/smb_mapjoin_15.q.out 6ed680d 
>   ql/src/test/results/clientpositive/spark/smb_mapjoin_16.q.out a4fd7c3 
>   ql/src/test/results/clientpositive/spark/smb_mapjoin_17.q.out 6293450 
>   ql/src/test/results/clientpositive/spark/smb_mapjoin_2.q.out 1cf144b 
>   ql/src/test/results/clientpositive/spark/smb_mapjoin_3.q.out 6b44d2c 
>   ql/src/test/results/clientpositive/spark/smb_mapjoin_4.q.out d07d65a 
>   ql/src/test/results/clientpositive/spark/smb_mapjoin_5.q.out 607b1f0 
>   ql/src/test/results/clientpositive/spark/smb_mapjoin_6.q.out 30746ff 
>   ql/src/test/results/clientpositive/spark/smb_mapjoin_7.q.out c48ed6d 
> 
> Diff: https://reviews.apache.org/r/29281/diff/
> 
> 
> Testing
> -------
> 
> Re-enabled all the smb_mapjoin.* tests.
> 
> I saw that a lot of the tests are again not alphabetized, so re-ran the 
> script to alphabeticize them.  As part of that, realized that some tests like 
> 'bucket_map_join_spark.*' and 'join_empty' were missing proper comma 
> deliminters from the next test and probably not ran.  Also fixed the 
> windowing.q which is the last test.  This is all unrelated, but I am not sure 
> if they will trigger additional test failures if these were unintentionally 
> disabled.
> 
> 
> Thanks,
> 
> Szehon Ho
> 
>

Re: Review Request 29281: HIVE-8640 : Support hints of SMBJoin [Spark Branch]

Reply via email to