> On Dec. 20, 2014, 1:36 a.m., Xuefu Zhang wrote: > > Patch looks good and clean. However, I have two high-level questions: > > 1. what's a SMB hint? For map-join, hint is /*+ MAPJOIN(X)*/. It doesn't > > seem even close for SMB hint. > > 2. from the code, it seems we only try to convert a map join to bucket map > > join and further to SMB join. But does it make sense (out of the scope of > > this JIRA) that we can convert a join to bucket map join even if the join > > cannot be converted to map join? > > Szehon Ho wrote: > Thanks Xuefu for looking at this! Answers inline. > > 1. They reuse the same syntax. For example in that scenario (x) is the > 'small table' of the SMB join. It is the one that is loaded into each mapper. > 2. Bucket join is an advanced case of mapjoin. In other words, a bucket > join is always better. And I think bucket join in terms of the operator-tree > is just like mapjoin but with some metadata about buckets as far as I > understand so the progression makes sense.
Regarding map join -> bucket map join, I was thinking there might be a case where a bucket table cannot be converted to map join (maybe due to memory requirement), but can be converted bucket map join (because of small memory print of a bucket instead of the whole table.) We can discuss more about this later, as it's not in the scope of this JIRA. > On Dec. 20, 2014, 1:36 a.m., Xuefu Zhang wrote: > > itests/src/test/resources/testconfiguration.properties, line 545 > > <https://reviews.apache.org/r/29281/diff/1/?file=797796#file797796line545> > > > > I think some of the "wrong" ordering is caused by different > > understanding of ordering. For instance, mapjoin1.q -> mapjoin10.q -> > > mapjoin11.q -> mapjoin12.q -> mapjoin2.q, or mapjoin1.q -> mapjoin2.q -> > > mapjoin10.q -> mapjoin11.q -> mapjoin12.q. Since last reordering, we only > > added a few tests manually, which wouldn't have caused so much disrruption. > > Szehon Ho wrote: > I ran Brock's script provided in the comments of > testconfiguration.properties that ensures the order. It is based on unix > sort. I thought it is the one we agreed to use? Well, I think Jimmy used a slightly modified version of script to order the list. He found Brock's script had certain issues. Did you see anything obviouly wrong besides mapjoin1 -> mapjoin2 or mapjoin1 -> majoin10? It's up to you either fix individual lines or reorder globally. > On Dec. 20, 2014, 1:36 a.m., Xuefu Zhang wrote: > > ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkSMBJoinHintOptimizer.java, > > line 39 > > <https://reviews.apache.org/r/29281/diff/1/?file=797799#file797799line39> > > > > The rule matches only MapJoinOperator, is it possible for nd a > > SMBMapJoinOperator? > > Szehon Ho wrote: > It's a weird check that I copied from the mapreduce side. It should not > happen in our use-case (or even their's). I dont think it hurts though. > Should I get rid of it? The check itself doesn't matter, but it causes confusion in understanding. If you need to update the patch, let's remove it. > On Dec. 20, 2014, 1:36 a.m., Xuefu Zhang wrote: > > ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java, line > > 129 > > <https://reviews.apache.org/r/29281/diff/1/?file=797801#file797801line129> > > > > Just curious. The previous rule might change JoinOperator to > > MapJoinOperator during the walking. is this rule able to process these > > converted MapJoinOperators in the same walk? > > Szehon Ho wrote: > Only one rule runs per operator, so the output of that wont get fed into > this. It is based on the operator node at the time of traversal. Got it. Thanks. - Xuefu ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29281/#review65714 ----------------------------------------------------------- On Dec. 20, 2014, 12:01 a.m., Szehon Ho wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/29281/ > ----------------------------------------------------------- > > (Updated Dec. 20, 2014, 12:01 a.m.) > > > Review request for hive. > > > Bugs: HIVE-8640 > https://issues.apache.org/jira/browse/HIVE-8640 > > > Repository: hive-git > > > Description > ------- > > This change is on the same principle as the refactoring of HIVE-8639. The > goal is to move as much of the join optimization as possible to the same > traversal, and in fact the same process(joinOp) method, to simplify the logic > and also for compiler performance. > > Whereas it is too hard to bring SparkMapJoinProcessor (for mapjoin hints) > into the same level due to the way it was written (see HIVE-8911), it is > possible to bring Bucket join and SMB join hints to the same level. This > change introduces a parallel processor called 'SparkJoinHintOptimizer', which > takes a mapjoin already converted by SparkMapJoinProcessor as input and > converts it to Bucket/SMB join accordingly. It runs alongside > 'SparkJoinOptimizer' which takes a common join operator and handles the > auto-conversion to mapjoin/bucketJoin/SMBJoin. > > The one difference between mapjoin/bucketJoin vs SMB as Chao found was that > while Spark mapjoins expect RS for small-table branches in > mapjoin/bucketJoin, this is not expected for SMB join. So I added a class > SparkSMBHintJoinOptimizer that first removes this before re-using the rest of > the existing code. > > Another issue was found in NonBlockingOpDeDupProc that corrupts > 'mapJoinContext' data structure in the parse context. A fix is offered in > HIVE-9117 and that should be committed to trunk and merged first, but it is > included here for reference. > > > Diffs > ----- > > itests/src/test/resources/testconfiguration.properties fd732c1 > ql/src/java/org/apache/hadoop/hive/ql/optimizer/NonBlockingOpDeDupProc.java > 5e0959a > > ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkJoinHintOptimizer.java > PRE-CREATION > > ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkSMBJoinHintOptimizer.java > PRE-CREATION > > ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkSortMergeJoinOptimizer.java > 6a47513 > ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java > 5227d92 > ql/src/test/results/clientpositive/spark/smb_mapjoin9.q.out d769ebe > ql/src/test/results/clientpositive/spark/smb_mapjoin_1.q.out 8d0527e > ql/src/test/results/clientpositive/spark/smb_mapjoin_10.q.out 2df87cf > ql/src/test/results/clientpositive/spark/smb_mapjoin_11.q.out PRE-CREATION > ql/src/test/results/clientpositive/spark/smb_mapjoin_12.q.out PRE-CREATION > ql/src/test/results/clientpositive/spark/smb_mapjoin_13.q.out 5637206 > ql/src/test/results/clientpositive/spark/smb_mapjoin_14.q.out 3aed084 > ql/src/test/results/clientpositive/spark/smb_mapjoin_15.q.out 6ed680d > ql/src/test/results/clientpositive/spark/smb_mapjoin_16.q.out a4fd7c3 > ql/src/test/results/clientpositive/spark/smb_mapjoin_17.q.out 6293450 > ql/src/test/results/clientpositive/spark/smb_mapjoin_2.q.out 1cf144b > ql/src/test/results/clientpositive/spark/smb_mapjoin_3.q.out 6b44d2c > ql/src/test/results/clientpositive/spark/smb_mapjoin_4.q.out d07d65a > ql/src/test/results/clientpositive/spark/smb_mapjoin_5.q.out 607b1f0 > ql/src/test/results/clientpositive/spark/smb_mapjoin_6.q.out 30746ff > ql/src/test/results/clientpositive/spark/smb_mapjoin_7.q.out c48ed6d > > Diff: https://reviews.apache.org/r/29281/diff/ > > > Testing > ------- > > Re-enabled all the smb_mapjoin.* tests. > > I saw that a lot of the tests are again not alphabetized, so re-ran the > script to alphabeticize them. As part of that, realized that some tests like > 'bucket_map_join_spark.*' and 'join_empty' were missing proper comma > deliminters from the next test and probably not ran. Also fixed the > windowing.q which is the last test. This is all unrelated, but I am not sure > if they will trigger additional test failures if these were unintentionally > disabled. > > > Thanks, > > Szehon Ho > >
