wangyum opened a new pull request #22778:
URL: https://github.com/apache/spark/pull/22778


   ## What changes were proposed in this pull request?
   
   [SPARK-22662](https://issues.apache.org/jira/browse/SPARK-22662) fixed 
failed to prune columns after rewriting predicate subquery.  but infer filters 
is still missing, this pr fix it.
   ```sql
   set spark.sql.autoBroadcastJoinThreshold=-1;
   create table t1 using parquet as select 1 as col1, 2 as col2;
   create table t2 using parquet as select * from t1;
   ```
   Before this patch:
   ```
   spark-sql> explain select t1.* from t1 where t1.col1 in (select col1 from 
t2);
   == Physical Plan ==
   SortMergeJoin [col1#3], [col1#5], LeftSemi
   :- *(2) Sort [col1#3 ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(col1#3, 200)
   :     +- *(1) FileScan parquet default.t1[col1#3,col2#4] Batched: true, 
DataFilters: [], Format: Parquet, Location: 
InMemoryFileIndex[file:/Users/yumwang/data/spark/spark-3.0.0-SNAPSHOT-bin-thriftserver/spark-ware...,
 PartitionFilters: [], PushedFilters: [], ReadSchema: struct<col1:int,col2:int>
   +- *(4) Sort [col1#5 ASC NULLS FIRST], false, 0
      +- Exchange hashpartitioning(col1#5, 200)
         +- *(3) Project [col1#5]
            +- *(3) FileScan parquet default.t2[col1#5] Batched: true, 
DataFilters: [], Format: Parquet, Location: 
InMemoryFileIndex[file:/Users/yumwang/data/spark/spark-3.0.0-SNAPSHOT-bin-thriftserver/spark-ware...,
 PartitionFilters: [], PushedFilters: [], ReadSchema: struct<col1:int>
   ```
   After this patch:
   ```
   spark-sql> explain select t1.* from t1 where t1.col1 in (select col1 from 
t2);
   == Physical Plan ==
   SortMergeJoin [col1#6], [col1#11], LeftSemi
   :- *(2) Sort [col1#6 ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(col1#6, 200)
   :     +- *(1) Project [col1#6, col2#7]
   :        +- *(1) Filter isnotnull(col1#6)
   :           +- *(1) FileScan parquet default.t1[col1#6,col2#7] Batched: 
true, DataFilters: [isnotnull(col1#6)], Format: Parquet, Location: 
InMemoryFileIndex[file:/user/hive/warehouse/t1], PartitionFilters: [], 
PushedFilters: [IsNotNull(col1)], ReadSchema: struct<col1:int,col2:int>
   +- *(4) Sort [col1#11 ASC NULLS FIRST], false, 0
      +- Exchange hashpartitioning(col1#11, 200)
         +- *(3) Project [col1#11]
            +- *(3) Filter isnotnull(col1#11)
               +- *(3) FileScan parquet default.t2[col1#11] Batched: true, 
DataFilters: [isnotnull(col1#11)], Format: Parquet, Location: 
InMemoryFileIndex[file:/user/hive/warehouse/t2], PartitionFilters: [], 
PushedFilters: [IsNotNull(col1)], ReadSchema: struct<col1:int>
   ```
   
   ## How was this patch tested?
   unit tests and benchmark tests
   
   **1. Benchmark test 1**
   
   ```scala
   withTempView("t1", "t2") {
     withTempDir { dir =>
       spark.range(3000000)
         .selectExpr("cast(null as int) as c1", "if(id % 2 = 0, null, id) as 
c2", "id as c3")
         .coalesce(1)
         .orderBy("c2")
         .write
         .mode("overwrite")
         .option("parquet.block.size", 10485760)
         .parquet(dir.getCanonicalPath)
   
       spark.read.parquet(dir.getCanonicalPath).createTempView("t1")
       spark.read.parquet(dir.getCanonicalPath).createTempView("t2")
   
       Seq("c1", "c2", "c3").foreach { column =>
         val benchmark = new Benchmark(s"join key $column", 10)
         Seq(false, true).foreach { inferFilters =>
           benchmark.addCase(s"Is infer filters $inferFilters", numIters = 5) { 
_ =>
             withSQLConf(SQLConf.CONSTRAINT_PROPAGATION_ENABLED.key -> 
inferFilters.toString) {
               sql(s"select t1.* from t1 where t1.$column in (select $column 
from t2)").count()
             }
           }
         }
         benchmark.run()
       }
     }
   }
   ```
   
   ```
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6
   Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
   join key c1:                             Best/Avg Time(ms)    Rate(M/s)   
Per Row(ns)   Relative
   
------------------------------------------------------------------------------------------------
   Is infer filters false                        2005 / 2163          0.0   
200481431.0       1.0X
   Is infer filters true                          190 /  207          0.0    
18962935.7      10.6X
   
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6
   Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
   join key c2:                             Best/Avg Time(ms)    Rate(M/s)   
Per Row(ns)   Relative
   
------------------------------------------------------------------------------------------------
   Is infer filters false                        2368 / 2498          0.0   
236803743.1       1.0X
   Is infer filters true                         1234 / 1268          0.0   
123443912.3       1.9X
   
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6
   Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
   join key c3:                             Best/Avg Time(ms)    Rate(M/s)   
Per Row(ns)   Relative
   
------------------------------------------------------------------------------------------------
   Is infer filters false                        2754 / 2907          0.0   
275376009.7       1.0X
   Is infer filters true                         2237 / 2255          0.0   
223739457.8       1.2X
   ```
   
   **2. Benchmark test 2**
   
   This change affect TPC-DS: q4, q5, q8, q10, q11, q14a, q14b, q16, q23a, 
q23b, q33, q35, q56, q58, q60, q69, q70, q74, q83, q93, q94, q95, q5a, q10a, 
q14, q35, q35a, q70a, q74. and this is benchmark result(note that: you can 
click SQL name to compare the optimized plan before and after):
   
   SQL | Before Best/Avg Time(ms) | After Best/Avg Time(ms) | Before Avg 
Time(ms) | After Avg Time(ms) | Relative
   -- | -- | -- | -- | -- | --
   
[q35a-v2.7](https://user-images.githubusercontent.com/5399861/47769506-1c162600-dd17-11e8-9d67-3c692aa90f7d.png)
 | 18534 / 21968 | 17030 / 18824 | 21968 | 18824 | 0.8569
   
[q69](https://user-images.githubusercontent.com/5399861/47769509-1fa9ad00-dd17-11e8-9a2d-96005834491e.png)
 | 21560 / 26378 | 21565 / 23928 | 26378 | 23928 | 0.9071
   q35 | 25057 / 27146 | 18702 / 25185 | 27146 | 25185 | 0.9278
   q56 | 31076 / 34614 | 25384 / 32604 | 34614 | 32604 | 0.9419
   q33 | 25762 / 29239 | 25451 / 28012 | 29239 | 28012 | 0.958
   q14-v2.7 | 131016 / 140958 | 131630 / 137203 | 140958 | 137203 | 0.9734
   q4 | 145648 / 147785 | 140788 / 143970 | 147785 | 143970 | 0.9742
   q10a-v2.7 | 19750 / 21240 | 19044 / 20814 | 21240 | 20814 | 0.9799
   q70a-v2.7 | 16184 / 16571 | 16135 / 16279 | 16571 | 16279 | 0.9824
   q11-v2.7 | 50646 / 54571 | 46192 / 53667 | 54571 | 53667 | 0.9834
   q95 | 121299 / 129225 | 121197 / 128010 | 129225 | 128010 | 0.9906
   q16 | 40738 / 42717 | 40133 / 42402 | 42717 | 42402 | 0.9926
   q83 | 25370 / 26521 | 24994 / 26405 | 26521 | 26405 | 0.9956
   q8 | 8837 / 11696 | 8688 / 11670 | 11696 | 11670 | 0.9978
   q14b | 165137 / 169002 | 164642 / 168943 | 169002 | 168943 | 0.9997
   q23a | 143793 / 151616 | 143302 / 151598 | 151616 | 151598 | 0.9999
   q14a-v2.7 | 192375 / 212209 | 194224 / 212386 | 212209 | 212386 | 1.0008
   q23b | 135824 / 145186 | 137020 / 145420 | 145186 | 145420 | 1.0016
   q14a | 210252 / 214839 | 210073 / 215748 | 214839 | 215748 | 1.0042
   q10 | 17649 / 21792 | 18157 / 21951 | 21792 | 21951 | 1.0073
   q58 | 29045 / 31574 | 29685 / 31910 | 31574 | 31910 | 1.0106
   q74 | 41370 / 46157 | 43644 / 46670 | 46157 | 46670 | 1.0111
   q5a-v2.7 | 58671 / 64598 | 59746 / 65335 | 64598 | 65335 | 1.0114
   q60 | 31167 / 35465 | 29486 / 35930 | 35465 | 35930 | 1.0131
   q74-v2.7 | 41394 / 44803 | 42327 / 45914 | 44803 | 45914 | 1.0248
   q11 | 49738 / 54953 | 52977 / 57312 | 54953 | 57312 | 1.0429
   q93 | 72402 / 74155 | 73451 / 77526 | 74155 | 77526 | 1.0455
   q5 | 62540 / 65776 | 66606 / 69293 | 65776 | 69293 | 1.0535
   q35-v2.7 | 17534 / 19987 | 19913 / 21244 | 19987 | 21244 | 1.0629
   
[q94](https://user-images.githubusercontent.com/5399861/47770003-b0cd5380-dd18-11e8-841f-8ad40ae3577a.png)
 | 20375 / 20631 | 23385 / 24338 | 20631 | 24338 | 1.1797
   
[q70](https://user-images.githubusercontent.com/5399861/47769990-a1e6a100-dd18-11e8-9448-7634dfc579d3.png)
 | 17182 / 19610 | 18252 / 23405 | 19610 | 23405 | 1.1935
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to