Re: Review Request 29281: HIVE-8640 : Support hints of SMBJoin [Spark Branch]

Xuefu Zhang Fri, 19 Dec 2014 19:08:39 -0800


> On Dec. 20, 2014, 1:36 a.m., Xuefu Zhang wrote:
> > Patch looks good and clean. However, I have two high-level questions:
> > 1. what's a SMB hint? For map-join, hint is /*+ MAPJOIN(X)*/. It doesn't 
> > seem even close for SMB hint.
> > 2. from the code, it seems we only try to convert a map join to bucket map 
> > join and further to SMB join. But does it make sense (out of the scope of 
> > this JIRA) that we can convert a join to bucket map join even if the join 
> > cannot be converted to map join?
> 
> Szehon Ho wrote:
>     Thanks Xuefu for looking at this!  Answers inline.
>     
>     1.  They reuse the same syntax.  For example in that scenario (x) is the 
> 'small table' of the SMB join.  It is the one that is loaded into each mapper.
>     2.  Bucket join is an advanced case of mapjoin.  In other words, a bucket 
> join is always better.  And I think bucket join in terms of the operator-tree 
> is just like mapjoin but with some metadata about buckets as far as I 
> understand so the progression makes sense.


Regarding map join -> bucket map join, I was thinking there might be a case 
where a bucket table cannot be converted to map join (maybe due to memory 
requirement), but can be converted bucket map join (because of small memory 
print of a bucket instead of the whole table.) We can discuss more about this 
later, as it's not in the scope of this JIRA.


> On Dec. 20, 2014, 1:36 a.m., Xuefu Zhang wrote:
> > itests/src/test/resources/testconfiguration.properties, line 545
> > <https://reviews.apache.org/r/29281/diff/1/?file=797796#file797796line545>
> >
> >     I think some of the "wrong" ordering is caused by different 
> > understanding of ordering. For instance, mapjoin1.q -> mapjoin10.q -> 
> > mapjoin11.q -> mapjoin12.q -> mapjoin2.q, or  mapjoin1.q -> mapjoin2.q -> 
> > mapjoin10.q -> mapjoin11.q -> mapjoin12.q. Since last reordering, we only 
> > added a few tests manually, which wouldn't have caused so much disrruption.
> 
> Szehon Ho wrote:
>     I ran Brock's script provided in the comments of 
> testconfiguration.properties that ensures the order.  It is based on unix 
> sort.  I thought it is the one we agreed to use?

Well, I think Jimmy used a slightly modified version of script to order the 
list. He found Brock's script had certain issues. Did you see anything obviouly 
wrong besides mapjoin1 -> mapjoin2 or mapjoin1 -> majoin10? It's up to you 
either fix individual lines or reorder globally.


> On Dec. 20, 2014, 1:36 a.m., Xuefu Zhang wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkSMBJoinHintOptimizer.java,
> >  line 39
> > <https://reviews.apache.org/r/29281/diff/1/?file=797799#file797799line39>
> >
> >     The rule matches only MapJoinOperator, is it possible for nd a 
> > SMBMapJoinOperator?
> 
> Szehon Ho wrote:
>     It's a weird check that I copied from the mapreduce side.  It should not 
> happen in our use-case (or even their's).  I dont think it hurts though.  
> Should I get rid of it?

The check itself doesn't matter, but it causes confusion in understanding. If 
you need to update the patch, let's remove it.


> On Dec. 20, 2014, 1:36 a.m., Xuefu Zhang wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java, line 
> > 129
> > <https://reviews.apache.org/r/29281/diff/1/?file=797801#file797801line129>
> >
> >     Just curious. The previous rule might change JoinOperator to 
> > MapJoinOperator during the walking. is this rule able to process these 
> > converted MapJoinOperators in the same walk?
> 
> Szehon Ho wrote:
>     Only one rule runs per operator, so the output of that wont get fed into 
> this.  It is based on the operator node at the time of traversal.

Got it. Thanks.


- Xuefu


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29281/#review65714
-----------------------------------------------------------


On Dec. 20, 2014, 12:01 a.m., Szehon Ho wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/29281/
> -----------------------------------------------------------
> 
> (Updated Dec. 20, 2014, 12:01 a.m.)
> 
> 
> Review request for hive.
> 
> 
> Bugs: HIVE-8640
>     https://issues.apache.org/jira/browse/HIVE-8640
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> This change is on the same principle as the refactoring of HIVE-8639.  The 
> goal is to move as much of the join optimization as possible to the same 
> traversal, and in fact the same process(joinOp) method, to simplify the logic 
> and also for compiler performance.
> 
> Whereas it is too hard to bring SparkMapJoinProcessor (for mapjoin hints) 
> into the same level due to the way it was written (see HIVE-8911), it is 
> possible to bring Bucket join and SMB join hints to the same level.  This 
> change introduces a parallel processor called 'SparkJoinHintOptimizer', which 
> takes a mapjoin already converted by SparkMapJoinProcessor as input and 
> converts it to Bucket/SMB join accordingly.  It runs alongside 
> 'SparkJoinOptimizer' which takes a common join operator and handles the 
> auto-conversion to mapjoin/bucketJoin/SMBJoin.
> 
> The one difference between mapjoin/bucketJoin vs SMB as Chao found was that 
> while Spark mapjoins expect RS for small-table branches in 
> mapjoin/bucketJoin, this is not expected for SMB join.  So I added a class 
> SparkSMBHintJoinOptimizer that first removes this before re-using the rest of 
> the existing code.
> 
> Another issue was found in NonBlockingOpDeDupProc that corrupts 
> 'mapJoinContext' data structure in the parse context.  A fix is offered in 
> HIVE-9117 and that should be committed to trunk and merged first, but it is 
> included here for reference.
> 
> 
> Diffs
> -----
> 
>   itests/src/test/resources/testconfiguration.properties fd732c1 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/NonBlockingOpDeDupProc.java 
> 5e0959a 
>   
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkJoinHintOptimizer.java
>  PRE-CREATION 
>   
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkSMBJoinHintOptimizer.java
>  PRE-CREATION 
>   
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkSortMergeJoinOptimizer.java
>  6a47513 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java 
> 5227d92 
>   ql/src/test/results/clientpositive/spark/smb_mapjoin9.q.out d769ebe 
>   ql/src/test/results/clientpositive/spark/smb_mapjoin_1.q.out 8d0527e 
>   ql/src/test/results/clientpositive/spark/smb_mapjoin_10.q.out 2df87cf 
>   ql/src/test/results/clientpositive/spark/smb_mapjoin_11.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/spark/smb_mapjoin_12.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/spark/smb_mapjoin_13.q.out 5637206 
>   ql/src/test/results/clientpositive/spark/smb_mapjoin_14.q.out 3aed084 
>   ql/src/test/results/clientpositive/spark/smb_mapjoin_15.q.out 6ed680d 
>   ql/src/test/results/clientpositive/spark/smb_mapjoin_16.q.out a4fd7c3 
>   ql/src/test/results/clientpositive/spark/smb_mapjoin_17.q.out 6293450 
>   ql/src/test/results/clientpositive/spark/smb_mapjoin_2.q.out 1cf144b 
>   ql/src/test/results/clientpositive/spark/smb_mapjoin_3.q.out 6b44d2c 
>   ql/src/test/results/clientpositive/spark/smb_mapjoin_4.q.out d07d65a 
>   ql/src/test/results/clientpositive/spark/smb_mapjoin_5.q.out 607b1f0 
>   ql/src/test/results/clientpositive/spark/smb_mapjoin_6.q.out 30746ff 
>   ql/src/test/results/clientpositive/spark/smb_mapjoin_7.q.out c48ed6d 
> 
> Diff: https://reviews.apache.org/r/29281/diff/
> 
> 
> Testing
> -------
> 
> Re-enabled all the smb_mapjoin.* tests.
> 
> I saw that a lot of the tests are again not alphabetized, so re-ran the 
> script to alphabeticize them.  As part of that, realized that some tests like 
> 'bucket_map_join_spark.*' and 'join_empty' were missing proper comma 
> deliminters from the next test and probably not ran.  Also fixed the 
> windowing.q which is the last test.  This is all unrelated, but I am not sure 
> if they will trigger additional test failures if these were unintentionally 
> disabled.
> 
> 
> Thanks,
> 
> Szehon Ho
> 
>

Re: Review Request 29281: HIVE-8640 : Support hints of SMBJoin [Spark Branch]

Reply via email to