[GitHub] incubator-quickstep pull request #237: QUICKSTEP-89 Add support for common s...

jianqiao Sat, 22 Apr 2017 18:34:08 -0700

GitHub user jianqiao opened a pull request:

    https://github.com/apache/incubator-quickstep/pull/237


    QUICKSTEP-89 Add support for common subexpression.

    This PR adds support for common subexpression elimination in Quickstep. In 
summary we apply two kinds of optimizations:
    **(1)** Query rewriting for _aggregate expressions_, e.g.
    ```
    SELECT SUM(x), SUM(x)*2, AVG(x), COUNT(x) FROM s;
    ```
    will be rewritten as
    ```
    SELECT sum_x, sum_x * 2, sum_x / count_x, count_x
    FROM (
      SELECT SUM(x) AS sum_x, COUNT(x) AS count_x
    ) t;
    ```
    **(2)** Common subexpression labeling with **per-storage-block memoization 
table** to cache result column vectors to avoid redundant computation. For 
example, the evaluation of expressions in
    ```
    SELECT (x + 1) * (y + 2) * 3, (x + 1) * (y + 2), x + 1
    FROM s;
    ```
    will essentially look like:
    ```
    t1 = x + 1
    t2 = t1 * (y + 2)
    Out1 = t2 * 3
    Out2 = Reference(t2)
    Out3 = Reference(t3)
    ```
    Here `t1`, `t2`, `Out1`, `Out2`, `Out3` are per-storage-block 
`ColumnVector`'s.
    
    This PR together with a (later) new aggregation hash table implementation 
will improve the performance of TPC-H Q01 from ~11.5s to ~5.3s, with scale 
factor 100 on a cloud lab machine.
    
    ___
    **Note**: For optimization (1), note that it is not free-of-cost to 
"re-project" the columns, so it may not worth doing the transformation in 
situations where a group-by aggregation has large number of groups. E.g., it 
may actually slow down the query to rewrite
    ```
    SELECT SUM(a), SUM(b), SUM(c), SUM(d), SUM(a)
    FROM s
    GROUP BY x;
    ```
    as
    ```
    SELECT sum_a, sum_b, sum_c, sum_d, sum_a
    FROM (
      SELECT SUM(a) AS sum_a, SUM(b) AS sum_b, SUM(c) AS sum_c, SUM(d) AS sum_d
      FROM s
      GROUP BY x;
    ) t;
    ```
    if the number of distinct values of attribute x is large -- in that case, 
we avoid one duplicate computation of `SUM(a)`, but introduce 5 extra expensive 
column copying due to the intermediate relation `t`.
    
    Thus currently we employ a simple heuristic that if either
    _(1) The estimated number of groups for the Aggregate node is smaller than 
the specified `FLAGS_reuse_aggregate_group_size_threshold`._
    or
    _(2) The ratio of (# of eliminable columns) to (# of original columns) is 
greater than the specified `FLAGS_reuse_aggregate_ratio_threshold`._
    then we perform the transformation.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/apache/incubator-quickstep 
common-subexpression

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-quickstep/pull/237.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #237
    
----
commit b31f0360a93727ed1250e8e3cf28c242c3d727a1
Author: Jianqiao Zhu <jianq...@cs.wisc.edu>
Date:   2017-04-20T20:28:12Z

    Add common-subexpression support.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-quickstep pull request #237: QUICKSTEP-89 Add support for common s...

Reply via email to