wangyum opened a new pull request #22778: URL: https://github.com/apache/spark/pull/22778
## What changes were proposed in this pull request? [SPARK-22662](https://issues.apache.org/jira/browse/SPARK-22662) fixed failed to prune columns after rewriting predicate subquery. but infer filters is still missing, this pr fix it. ```sql set spark.sql.autoBroadcastJoinThreshold=-1; create table t1 using parquet as select 1 as col1, 2 as col2; create table t2 using parquet as select * from t1; ``` Before this patch: ``` spark-sql> explain select t1.* from t1 where t1.col1 in (select col1 from t2); == Physical Plan == SortMergeJoin [col1#3], [col1#5], LeftSemi :- *(2) Sort [col1#3 ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(col1#3, 200) : +- *(1) FileScan parquet default.t1[col1#3,col2#4] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/Users/yumwang/data/spark/spark-3.0.0-SNAPSHOT-bin-thriftserver/spark-ware..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<col1:int,col2:int> +- *(4) Sort [col1#5 ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(col1#5, 200) +- *(3) Project [col1#5] +- *(3) FileScan parquet default.t2[col1#5] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/Users/yumwang/data/spark/spark-3.0.0-SNAPSHOT-bin-thriftserver/spark-ware..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<col1:int> ``` After this patch: ``` spark-sql> explain select t1.* from t1 where t1.col1 in (select col1 from t2); == Physical Plan == SortMergeJoin [col1#6], [col1#11], LeftSemi :- *(2) Sort [col1#6 ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(col1#6, 200) : +- *(1) Project [col1#6, col2#7] : +- *(1) Filter isnotnull(col1#6) : +- *(1) FileScan parquet default.t1[col1#6,col2#7] Batched: true, DataFilters: [isnotnull(col1#6)], Format: Parquet, Location: InMemoryFileIndex[file:/user/hive/warehouse/t1], PartitionFilters: [], PushedFilters: [IsNotNull(col1)], ReadSchema: struct<col1:int,col2:int> +- *(4) Sort [col1#11 ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(col1#11, 200) +- *(3) Project [col1#11] +- *(3) Filter isnotnull(col1#11) +- *(3) FileScan parquet default.t2[col1#11] Batched: true, DataFilters: [isnotnull(col1#11)], Format: Parquet, Location: InMemoryFileIndex[file:/user/hive/warehouse/t2], PartitionFilters: [], PushedFilters: [IsNotNull(col1)], ReadSchema: struct<col1:int> ``` ## How was this patch tested? unit tests and benchmark tests **1. Benchmark test 1** ```scala withTempView("t1", "t2") { withTempDir { dir => spark.range(3000000) .selectExpr("cast(null as int) as c1", "if(id % 2 = 0, null, id) as c2", "id as c3") .coalesce(1) .orderBy("c2") .write .mode("overwrite") .option("parquet.block.size", 10485760) .parquet(dir.getCanonicalPath) spark.read.parquet(dir.getCanonicalPath).createTempView("t1") spark.read.parquet(dir.getCanonicalPath).createTempView("t2") Seq("c1", "c2", "c3").foreach { column => val benchmark = new Benchmark(s"join key $column", 10) Seq(false, true).foreach { inferFilters => benchmark.addCase(s"Is infer filters $inferFilters", numIters = 5) { _ => withSQLConf(SQLConf.CONSTRAINT_PROPAGATION_ENABLED.key -> inferFilters.toString) { sql(s"select t1.* from t1 where t1.$column in (select $column from t2)").count() } } } benchmark.run() } } } ``` ``` Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6 Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz join key c1: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ Is infer filters false 2005 / 2163 0.0 200481431.0 1.0X Is infer filters true 190 / 207 0.0 18962935.7 10.6X Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6 Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz join key c2: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ Is infer filters false 2368 / 2498 0.0 236803743.1 1.0X Is infer filters true 1234 / 1268 0.0 123443912.3 1.9X Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6 Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz join key c3: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ Is infer filters false 2754 / 2907 0.0 275376009.7 1.0X Is infer filters true 2237 / 2255 0.0 223739457.8 1.2X ``` **2. Benchmark test 2** This change affect TPC-DS: q4, q5, q8, q10, q11, q14a, q14b, q16, q23a, q23b, q33, q35, q56, q58, q60, q69, q70, q74, q83, q93, q94, q95, q5a, q10a, q14, q35, q35a, q70a, q74. and this is benchmark result(note that: you can click SQL name to compare the optimized plan before and after): SQL | Before Best/Avg Time(ms) | After Best/Avg Time(ms) | Before Avg Time(ms) | After Avg Time(ms) | Relative -- | -- | -- | -- | -- | -- [q35a-v2.7](https://user-images.githubusercontent.com/5399861/47769506-1c162600-dd17-11e8-9d67-3c692aa90f7d.png) | 18534 / 21968 | 17030 / 18824 | 21968 | 18824 | 0.8569 [q69](https://user-images.githubusercontent.com/5399861/47769509-1fa9ad00-dd17-11e8-9a2d-96005834491e.png) | 21560 / 26378 | 21565 / 23928 | 26378 | 23928 | 0.9071 q35 | 25057 / 27146 | 18702 / 25185 | 27146 | 25185 | 0.9278 q56 | 31076 / 34614 | 25384 / 32604 | 34614 | 32604 | 0.9419 q33 | 25762 / 29239 | 25451 / 28012 | 29239 | 28012 | 0.958 q14-v2.7 | 131016 / 140958 | 131630 / 137203 | 140958 | 137203 | 0.9734 q4 | 145648 / 147785 | 140788 / 143970 | 147785 | 143970 | 0.9742 q10a-v2.7 | 19750 / 21240 | 19044 / 20814 | 21240 | 20814 | 0.9799 q70a-v2.7 | 16184 / 16571 | 16135 / 16279 | 16571 | 16279 | 0.9824 q11-v2.7 | 50646 / 54571 | 46192 / 53667 | 54571 | 53667 | 0.9834 q95 | 121299 / 129225 | 121197 / 128010 | 129225 | 128010 | 0.9906 q16 | 40738 / 42717 | 40133 / 42402 | 42717 | 42402 | 0.9926 q83 | 25370 / 26521 | 24994 / 26405 | 26521 | 26405 | 0.9956 q8 | 8837 / 11696 | 8688 / 11670 | 11696 | 11670 | 0.9978 q14b | 165137 / 169002 | 164642 / 168943 | 169002 | 168943 | 0.9997 q23a | 143793 / 151616 | 143302 / 151598 | 151616 | 151598 | 0.9999 q14a-v2.7 | 192375 / 212209 | 194224 / 212386 | 212209 | 212386 | 1.0008 q23b | 135824 / 145186 | 137020 / 145420 | 145186 | 145420 | 1.0016 q14a | 210252 / 214839 | 210073 / 215748 | 214839 | 215748 | 1.0042 q10 | 17649 / 21792 | 18157 / 21951 | 21792 | 21951 | 1.0073 q58 | 29045 / 31574 | 29685 / 31910 | 31574 | 31910 | 1.0106 q74 | 41370 / 46157 | 43644 / 46670 | 46157 | 46670 | 1.0111 q5a-v2.7 | 58671 / 64598 | 59746 / 65335 | 64598 | 65335 | 1.0114 q60 | 31167 / 35465 | 29486 / 35930 | 35465 | 35930 | 1.0131 q74-v2.7 | 41394 / 44803 | 42327 / 45914 | 44803 | 45914 | 1.0248 q11 | 49738 / 54953 | 52977 / 57312 | 54953 | 57312 | 1.0429 q93 | 72402 / 74155 | 73451 / 77526 | 74155 | 77526 | 1.0455 q5 | 62540 / 65776 | 66606 / 69293 | 65776 | 69293 | 1.0535 q35-v2.7 | 17534 / 19987 | 19913 / 21244 | 19987 | 21244 | 1.0629 [q94](https://user-images.githubusercontent.com/5399861/47770003-b0cd5380-dd18-11e8-841f-8ad40ae3577a.png) | 20375 / 20631 | 23385 / 24338 | 20631 | 24338 | 1.1797 [q70](https://user-images.githubusercontent.com/5399861/47769990-a1e6a100-dd18-11e8-9448-7634dfc579d3.png) | 17182 / 19610 | 18252 / 23405 | 19610 | 23405 | 1.1935 ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
