[ 
https://issues.apache.org/jira/browse/CALCITE-481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14226429#comment-14226429
 ] 

Julian Hyde commented on CALCITE-481:
-------------------------------------

[~vladimirsitnikov] Some context. Hive is used is used for workloads that are 
basically ETL. The queries are large and complex and can take hours. Often the 
data does not have stats, or there are so many joins that the cardinality 
estimates for the later part of the query are junk. This is the kind of 
workload where I think Spool could be useful.

Regarding whether Spool is logical or physical. I agree that it mainly seems to 
be physical. But two data points against that:

(a) There might be multiple implementations of Spool: 1. write to a temp table, 
2. stream the results to multiple consumers and introduce a small 
reference-counting cyclic buffer so that the consumers are not in lock-step. 
This argues for spool having a base class (i.e. in the core package).

(b) A strong hint that spooling is an approach worth considering is the 
existence of a WITH clause. For example

{code}
WITH oldEmps AS (SELECT * FROM emp WHERE age > 70)
SELECT COUNT(*) FROM oldEmps, oldEmps
{code}

We could translate a WITH query into a Spool. We would be under no obligation 
to actually spool -- and one of the first rules we fire would be 
SpoolRemoveRule: converts {Spool - X} to {X}. This use case argues for Spool 
being a logical operator.

> Add "Spool" operator, to allow re-use of relational expressions
> ---------------------------------------------------------------
>
>                 Key: CALCITE-481
>                 URL: https://issues.apache.org/jira/browse/CALCITE-481
>             Project: Calcite
>          Issue Type: Bug
>            Reporter: Julian Hyde
>            Assignee: Julian Hyde
>
> If a sub-tree occurs more than once in a query an efficient plan would 
> probably evaluate once and have two readers read the same data. We propose a 
> "Spool" relational expression for this purpose.
> Spool would have one input, the expression that populates it.
> In the VolcanoPlanner, any RelNode can already have multiple consumers (each 
> of which sees the same row type and the same data) but an optimal plan does 
> not typically include multiple uses of the same node, so most implementors 
> (e.g. EnumerableRelImplementor) would just not notice, and generate the same 
> code twice. Having an explicit Spool would alert the implementor to re-use 
> the result.
> We do not prescribe a mechanism for implementing Spool as a physical 
> operator. A job that populates a temporary table is one possible mechanism.
> As part of this case, we should implement Spool in Enumerable convention, and 
> use it to evaluate some test queries.
> The other reason to implement Spool is costing. The cost of a Spool with N 
> consumers is typically something like A + B . N. A, the fixed cost, is 
> significantly larger than B, the re-play cost.
> Volcano's dynamic programming model does not make it easy to account for 
> re-use. There are approaches in academia based on integer linear programming; 
> see e.g. http://www.slideshare.net/INRIA-OAK/plreuse 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to