[GitHub] spark issue #14452: [SPARK-16849][SQL] Improve subquery execution by dedupli...

viirya Wed, 31 Aug 2016 19:39:58 -0700

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/14452
  
    @hvanhovell Let me try to explain this with an example.
    
        WITH cte AS (SELECT * FROM src) SELECT * FROM cte a JOIN cte b
    
    In above query, the common subquery `cte` will be executed twice. We find 
such common subqueries and wrap the executed plan of it into `CommonSubquery` 
node. These common subqueries which have the same results, will share the same 
executed plan and the same variable of computed results.
    
    In planning, we create `CommonSubqueryExec` for `CommonSubquery`. When 
`CommonSubqueryExec.doExecute` is called to materialized the results, we 
delegate to the executed plan wrapped in `CommonSubquery` and keep its results. 
As all common subqueries share the same executed plan and the variable of 
computed results, the later calling on `CommonSubqueryExec.doExecute` can 
directly take the computed results.
    
    We benchmark this patch on TPC-DS queries and see significant improvement 
on many queries which use CTE subqueries. We are trying to solve some filter 
pushdown issues and improve it further.
    
    Please let me know if it is clear for you.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #14452: [SPARK-16849][SQL] Improve subquery execution by dedupli...

Reply via email to