francis0407 opened a new pull request #24321: SPARK-27411: DataSourceV2Strategy 
should not eliminate subquery 
URL: https://github.com/apache/spark/pull/24321
 
 
   ## What changes were proposed in this pull request?
   
   In DataSourceV2Strategy, it seems we eliminate the subqueries by mistake 
after normalizing filters.
   We have an sql with a scalar subquery:
   
   ``` scala
   val plan = spark.sql("select * from t2 where t2a > (select max(t1a) from 
t1)")
   plan.explain(true)
   ```
   
   And we get the log info of DataSourceV2Strategy:
   ```
   Pushing operators to csv:examples/src/main/resources/t2.txt
   Pushed Filters: 
   Post-Scan Filters: isnotnull(t2a#30)
   Output: t2a#30, t2b#31
   ```
   
   The `Post-Scan Filters` should contain the scalar subquery, but we eliminate 
it by mistake.
   ```
   == Parsed Logical Plan ==
   'Project [*]
   +- 'Filter ('t2a > scalar-subquery#56 [])
      :  +- 'Project [unresolvedalias('max('t1a), None)]
      :     +- 'UnresolvedRelation `t1`
      +- 'UnresolvedRelation `t2`
   
   == Analyzed Logical Plan ==
   t2a: string, t2b: string
   Project [t2a#30, t2b#31]
   +- Filter (t2a#30 > scalar-subquery#56 [])
      :  +- Aggregate [max(t1a#13) AS max(t1a)#63]
      :     +- SubqueryAlias `t1`
      :        +- RelationV2[t1a#13, t1b#14] 
csv:examples/src/main/resources/t1.txt
      +- SubqueryAlias `t2`
         +- RelationV2[t2a#30, t2b#31] csv:examples/src/main/resources/t2.txt
   
   == Optimized Logical Plan ==
   Filter (isnotnull(t2a#30) && (t2a#30 > scalar-subquery#56 []))
   :  +- Aggregate [max(t1a#13) AS max(t1a)#63]
   :     +- Project [t1a#13]
   :        +- RelationV2[t1a#13, t1b#14] csv:examples/src/main/resources/t1.txt
   +- RelationV2[t2a#30, t2b#31] csv:examples/src/main/resources/t2.txt
   
   == Physical Plan ==
   *(1) Project [t2a#30, t2b#31]
   +- *(1) Filter isnotnull(t2a#30)
      +- *(1) BatchScan[t2a#30, t2b#31] class 
org.apache.spark.sql.execution.datasources.v2.csv.CSVScan
   ```
   ## How was this patch tested?
   
   ut
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to