[ 
https://issues.apache.org/jira/browse/CALCITE-481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14231729#comment-14231729
 ] 

Julian Hyde commented on CALCITE-481:
-------------------------------------

[~jcamachorodriguez], I can see what you are worried about. If we have a query 
{{Filter - Aggregate - Filter - X}} and we decide to materialize {{Filter - X}} 
as {{Spool - Filter - X}} then rules are going to fire that cause us to fire 
that cause us to materialize {{Spool - Aggregate - Filter - X}} and {{Spool - 
Filter - Aggregate - Filter - X}}. So, we are going to end up with a lot of 
spools!

This planner state would be valid -- any path we chose through the equivalence 
sets would form a plan that would return the correct results -- but we have 
increased the size of the search space. Clearly we don't want to use all of 
those spools. Intuitively, if we use one spool, then the spool directly 
consuming from top of it should become less attractive.

To your specific case: yes, X and Spool-X should be in the same equivalence 
set. Because they produce the same results. Volcano planners create scary-sized 
search spaces. That is why they are powerful. The solution is to manage the 
search space.

The usual techniques:

1. Don't write too many rules that create spools, and only enable them at 
prescribed times, not during the "free for all" volcano phase. 

2. During a "volcano" planning phase, use the cost model and "importance" to 
prioritize which rule matches get fired first.

Plus, specific to spooling:

3. Come up with a global costing algorithm that, given a plan space that 
contains a number of "suggested spools", can choose which spools to enable that 
minimizes the total cost.
 

> Add "Spool" operator, to allow re-use of relational expressions
> ---------------------------------------------------------------
>
>                 Key: CALCITE-481
>                 URL: https://issues.apache.org/jira/browse/CALCITE-481
>             Project: Calcite
>          Issue Type: Bug
>            Reporter: Julian Hyde
>            Assignee: Jesus Camacho Rodriguez
>
> If a sub-tree occurs more than once in a query an efficient plan would 
> probably evaluate once and have two readers read the same data. We propose a 
> "Spool" relational expression for this purpose.
> Spool would have one input, the expression that populates it.
> In the VolcanoPlanner, any RelNode can already have multiple consumers (each 
> of which sees the same row type and the same data) but an optimal plan does 
> not typically include multiple uses of the same node, so most implementors 
> (e.g. EnumerableRelImplementor) would just not notice, and generate the same 
> code twice. Having an explicit Spool would alert the implementor to re-use 
> the result.
> We do not prescribe a mechanism for implementing Spool as a physical 
> operator. A job that populates a temporary table is one possible mechanism.
> As part of this case, we should implement Spool in Enumerable convention, and 
> use it to evaluate some test queries.
> The other reason to implement Spool is costing. The cost of a Spool with N 
> consumers is typically something like A + B . N. A, the fixed cost, is 
> significantly larger than B, the re-play cost.
> Volcano's dynamic programming model does not make it easy to account for 
> re-use. There are approaches in academia based on integer linear programming; 
> see e.g. http://www.slideshare.net/INRIA-OAK/plreuse 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to