GitHub user jianqiao opened a pull request: https://github.com/apache/incubator-quickstep/pull/237
QUICKSTEP-89 Add support for common subexpression. This PR adds support for common subexpression elimination in Quickstep. In summary we apply two kinds of optimizations: **(1)** Query rewriting for _aggregate expressions_, e.g. ``` SELECT SUM(x), SUM(x)*2, AVG(x), COUNT(x) FROM s; ``` will be rewritten as ``` SELECT sum_x, sum_x * 2, sum_x / count_x, count_x FROM ( SELECT SUM(x) AS sum_x, COUNT(x) AS count_x ) t; ``` **(2)** Common subexpression labeling with **per-storage-block memoization table** to cache result column vectors to avoid redundant computation. For example, the evaluation of expressions in ``` SELECT (x + 1) * (y + 2) * 3, (x + 1) * (y + 2), x + 1 FROM s; ``` will essentially look like: ``` t1 = x + 1 t2 = t1 * (y + 2) Out1 = t2 * 3 Out2 = Reference(t2) Out3 = Reference(t3) ``` Here `t1`, `t2`, `Out1`, `Out2`, `Out3` are per-storage-block `ColumnVector`'s. This PR together with a (later) new aggregation hash table implementation will improve the performance of TPC-H Q01 from ~11.5s to ~5.3s, with scale factor 100 on a cloud lab machine. ___ **Note**: For optimization (1), note that it is not free-of-cost to "re-project" the columns, so it may not worth doing the transformation in situations where a group-by aggregation has large number of groups. E.g., it may actually slow down the query to rewrite ``` SELECT SUM(a), SUM(b), SUM(c), SUM(d), SUM(a) FROM s GROUP BY x; ``` as ``` SELECT sum_a, sum_b, sum_c, sum_d, sum_a FROM ( SELECT SUM(a) AS sum_a, SUM(b) AS sum_b, SUM(c) AS sum_c, SUM(d) AS sum_d FROM s GROUP BY x; ) t; ``` if the number of distinct values of attribute x is large -- in that case, we avoid one duplicate computation of `SUM(a)`, but introduce 5 extra expensive column copying due to the intermediate relation `t`. Thus currently we employ a simple heuristic that if either _(1) The estimated number of groups for the Aggregate node is smaller than the specified `FLAGS_reuse_aggregate_group_size_threshold`._ or _(2) The ratio of (# of eliminable columns) to (# of original columns) is greater than the specified `FLAGS_reuse_aggregate_ratio_threshold`._ then we perform the transformation. You can merge this pull request into a Git repository by running: $ git pull https://github.com/apache/incubator-quickstep common-subexpression Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-quickstep/pull/237.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #237 ---- commit b31f0360a93727ed1250e8e3cf28c242c3d727a1 Author: Jianqiao Zhu <jianq...@cs.wisc.edu> Date: 2017-04-20T20:28:12Z Add common-subexpression support. ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---