asolimando commented on PR #21240:
URL: https://github.com/apache/datafusion/pull/21240#issuecomment-4162286204

   > ## Rationale for this change
   > Previously, DataFusion evaluated uncorrelated scalar subqueries by 
transforming them into joins. This has two shortcomings:
   > 
   > 1. Scalar subqueries that return > 1 row were allowed, producing incorrect 
query results. Such queries should instead result in a runtime error.
   > 2. Performance. Evaluating scalar subqueries as a join requires going 
through the join machinery. More importantly, it means that UDFs that have 
special-cases for scalar inputs cannot use those code paths for scalar 
subqueries, which often results in significantly slower query execution.
   > 
   > This PR introduces physical execution of uncorrelated scalar subqueries:
   > 
   > * Uncorrelated subqueries are left in the plan by the optimizer, not 
rewritten into joins
   
   I am not aware of any database going down this route, for multiple reasons:
   - you are potentially giving up on many transformations making the plan of 
the subquery faster 
(https://github.com/apache/datafusion/pull/21240#issuecomment-4158270781 is one 
example but it's probably the tip of the iceberg)
   - alternatively all your planning rules have to deal with subqueries now, 
but this will make them more complicated, and for some of them it's already 
challenging to prove correctness:  
https://github.com/apache/datafusion/issues/21174#issue-4143242322 comes to 
mind as a tricky correctness issue, and it would make it way more complex to 
reason over a plan where subqueries are preserved
   
   Point 1. is a bug of how subquery removal is implemented, not a limitation 
of subquery removal algorithms, so it shouldn't be used as a motivation for or 
against the approach.
   
   Point 2. seems a limitation worth addressing for improving the general join 
path, having most plans benefit, and not a blocker specific to subquery 
removal, but I must admit that I am not aware of the details of the limitations 
you mention.
   
   This said, my opinion is biased towards the "query planning" side of things, 
and it might not do justice to the execution perspective you bring up in point 
2., but I hope my POV can help with the discussion.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to