Flyangz opened a new pull request #36047:
URL: https://github.com/apache/spark/pull/36047


   ### What changes were proposed in this pull request?
   Add `OptimizeSubqueries` below the InjectRuntimeFilter in `SparkOptimizer`.
   
   ### Why are the changes needed?
   It seems BloomFilter subqueries injected by `InjectRuntimeFilter` will read 
as many columns as filterCreationSidePlan. This does not match "Only scan the 
required columns" as the design said. We can check this by a simple case:
   ```scala
   withSQLConf(SQLConf.RUNTIME_BLOOM_FILTER_ENABLED.key -> "true",
     SQLConf.RUNTIME_BLOOM_FILTER_APPLICATION_SIDE_SCAN_SIZE_THRESHOLD.key -> 
"3000",
     SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "2000") {
     val query = "select * from bf1 join bf2 on bf1.c1 = bf2.c2 where bf2.a2 = 
62"
     sql(query).explain()
   }
   ```
   The reason is subqueries have not been optimized by default rules like 
`ColumnPruning`, and this pr will fix it.
   
   Another option is adding `Project` node below `Aggregate` in 
`injectBloomFilter` like:
   ```scala
   val project = Project(filterCreationSideExp.references.toSeq, 
filterCreationSidePlan)
   val aggregate = ConstantFolding(Aggregate(Nil, Seq(alias), project))
   ```
   And that is what `ColumnPruning` does when optimizing BloomFilter creation 
subqueries.
   But I think adding `OptimizeSubqueries` is more comprehensive.
   
   ### Does this PR introduce _any_ user-facing change?
   No, not released
   
   ### How was this patch tested?
   Improve the test by adding `columnPruningTakesEffect` to check the 
optimizedPlan of bloom filter join.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to