[
https://issues.apache.org/jira/browse/SPARK-42551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wan Kun updated SPARK-42551:
----------------------------
Description:
h1. *Design Sketch*
h2. How to support more subexpressions elimination cases
* Get all common expressions from input expressions of the current physical
operator to current CodeGenContext. Recursively visits all subexpressions
regardless of whether the current expression is a conditional expression.
* For each common expression:
* Add a new boolean variable *subExprInit* to indicate whether it has already
been evaluated.
* Add a new code block in CodeGenSupport trait, and reset those *subExprInit*
variables to *false* before the physical operators begin to evaluate the input
row.
* Add a new wrapper subExpr function for each common subexpression.
|private void subExpr_n(${argList}) {
if (!subExprInit) {
${eval.code}
subExprInit_n = true;
subExprIsNull_n = ${eval.isNull};
subExprValue_n = ${eval.value};
}
}|
* When performing gen code of the input expression, if the input expression
is in the common expressions of the current CodeGenContext, the corresponding
subExpr function will be called. After the first function call, *subExprInit*
will be set to true, and the subsequent function calls will be skipped.
h2. Why should we support whole-stage subexpression elimination
Right now each spark physical operator shares nothing but the input row, so the
same expressions may be evaluated multiple times across different operators.
For example, the expression udf(c1, c2) in plan Project [udf(c1, c2)] - Filter
[udf(c1, c2) > 0] - Relation will be evaluated both in Project and Filter
operators. We can reuse the expression results across different operators such
as Project and Filter.
h2. How to support whole-stage subexpression elimination
* Add two properties in CodegenSupport trait, the reusable expressions and the
the output attributes, we can reuse the expression results only if the output
attributes are the same.
* Visit all operators from top to bottom, bound the candidate expressions with
the output attributes and add to the current candidate reusable expressions.
* Visit all operators from bottom to top, collect all the common expressions
to the current operator, and add the initialize code to the current operator if
the common expressions have not been initialized.
* Replace the common expressions code when generating codes for the physical
operators.
h1. *New support subexpression elimination patterns*
*
h2. *Support subexpression elimination with conditional expressions*
{code:java}
SELECT case when v + 2 > 1 then 1
when v + 1 > 2 then 2
when v + 1 > 3 then 3 END vv
FROM values(1) as t2(v)
{code}
We can reuse the result of expression *v + 1*
{code:java}
SELECT a, max(if(a > 0, b + c, null)) max_bc, min(if(a > 1, b + c, null)) min_bc
FROM values(1, 1, 1) as t(a, b, c)
GROUP BY a
{code}
We can reuse the result of expression b + c
*
h2. *Support subexpression elimination in FilterExec*
{code:java}
SELECT * FROM (
SELECT v * v + 1 v1 from values(1) as t2(v)
) t
where v1 > 5 and v1 < 10
{code}
We can reuse the result of expression *v* * *v* *+* *1*
*
h2. *Support subexpression elimination in JoinExec*
{code:java}
SELECT *
FROM values(1, 1) as t1(a, b)
join values(1, 2) as t2(x, y)
ON b * y between 2 and 3{code}
We can reuse the result of expression *b* * *y*
*
h2. *Support subexpression elimination in ExpandExec*
{code:java}
SELECT a, count(b),
count(distinct case when b > 1 then b + c else null end) as count_bc_1,
count(distinct case when b < 0 then b + c else null end) as count_bc_2
FROM values(1, 1, 1) as t(a, b, c)
GROUP BY a
{code}
We can reuse the result of expression b + c
was:
h1. *Design Sketch*
h2. How to support more subexpressions elimination cases
* Get all common expressions from input expressions of the current physical
operator to current CodeGenContext. Recursively visits all subexpressions
regardless of whether the current expression is a conditional expression.
* For each common expression:
* Add a new boolean variable *subExprInit* to indicate whether it has already
been evaluated.
* Add a new code block in CodeGenSupport trait, and reset those *subExprInit*
variables to *false* before the physical operators begin to evaluate the input
row.
* Add a new wrapper subExpr function for each common subexpression.
|private void subExpr_n(${argList}) {
if (!subExprInit) {
${eval.code}
subExprInit_n = true;
subExprIsNull_n = ${eval.isNull};
subExprValue_n = ${eval.value};
}
}|
h1.
**
* When performing gen code of the input expression, if the input expression
is in the common expressions of the current CodeGenContext, the corresponding
subExpr function will be called. After the first function call, *subExprInit*
will be set to true, and the subsequent function calls will be skipped.
h2. Why should we support whole-stage subexpression elimination
Right now each spark physical operator shares nothing but the input row, so the
same expressions may be evaluated multiple times across different operators.
For example, the expression udf(c1, c2) in plan Project [udf(c1, c2)] - Filter
[udf(c1, c2) > 0] - Relation will be evaluated both in Project and Filter
operators. We can reuse the expression results across different operators such
as Project and Filter.
h2. How to support whole-stage subexpression elimination
* Add two properties in CodegenSupport trait, the reusable expressions and the
the output attributes, we can reuse the expression results only if the output
attributes are the same.
* Visit all operators from top to bottom, bound the candidate expressions with
the output attributes and add to the current candidate reusable expressions.
* Visit all operators from bottom to top, collect all the common expressions
to the current operator, and add the initialize code to the current operator if
the common expressions have not been initialized.
* Replace the common expressions code when generating codes for the physical
operators.
h1. *New support subexpression elimination patterns*
*
h2. *Support subexpression elimination with conditional expressions*
{code:java}
SELECT case when v + 2 > 1 then 1
when v + 1 > 2 then 2
when v + 1 > 3 then 3 END vv
FROM values(1) as t2(v)
{code}
We can reuse the result of expression *v + 1*
{code:java}
SELECT a, max(if(a > 0, b + c, null)) max_bc, min(if(a > 1, b + c, null)) min_bc
FROM values(1, 1, 1) as t(a, b, c)
GROUP BY a
{code}
We can reuse the result of expression b + c
*
h2. *Support subexpression elimination in FilterExec*
{code:java}
SELECT * FROM (
SELECT v * v + 1 v1 from values(1) as t2(v)
) t
where v1 > 5 and v1 < 10
{code}
We can reuse the result of expression *v* * *v* *+* *1*
*
h2. *Support subexpression elimination in JoinExec*
{code:java}
SELECT *
FROM values(1, 1) as t1(a, b)
join values(1, 2) as t2(x, y)
ON b * y between 2 and 3{code}
We can reuse the result of expression *b* * *y*
*
h2. *Support subexpression elimination in ExpandExec*
{code:java}
SELECT a, count(b),
count(distinct case when b > 1 then b + c else null end) as count_bc_1,
count(distinct case when b < 0 then b + c else null end) as count_bc_2
FROM values(1, 1, 1) as t(a, b, c)
GROUP BY a
{code}
We can reuse the result of expression b + c
> Support more subexpression elimination cases
> --------------------------------------------
>
> Key: SPARK-42551
> URL: https://issues.apache.org/jira/browse/SPARK-42551
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 3.3.2
> Reporter: Wan Kun
> Priority: Major
>
> h1. *Design Sketch*
> h2. How to support more subexpressions elimination cases
> * Get all common expressions from input expressions of the current physical
> operator to current CodeGenContext. Recursively visits all subexpressions
> regardless of whether the current expression is a conditional expression.
> * For each common expression:
> * Add a new boolean variable *subExprInit* to indicate whether it has
> already been evaluated.
> * Add a new code block in CodeGenSupport trait, and reset those
> *subExprInit* variables to *false* before the physical operators begin to
> evaluate the input row.
> * Add a new wrapper subExpr function for each common subexpression.
> |private void subExpr_n(${argList}) {
> if (!subExprInit) {
> ${eval.code}
> subExprInit_n = true;
> subExprIsNull_n = ${eval.isNull};
> subExprValue_n = ${eval.value};
> }
> }|
>
> * When performing gen code of the input expression, if the input expression
> is in the common expressions of the current CodeGenContext, the corresponding
> subExpr function will be called. After the first function call, *subExprInit*
> will be set to true, and the subsequent function calls will be skipped.
> h2. Why should we support whole-stage subexpression elimination
> Right now each spark physical operator shares nothing but the input row, so
> the same expressions may be evaluated multiple times across different
> operators. For example, the expression udf(c1, c2) in plan Project [udf(c1,
> c2)] - Filter [udf(c1, c2) > 0] - Relation will be evaluated both in Project
> and Filter operators. We can reuse the expression results across different
> operators such as Project and Filter.
> h2. How to support whole-stage subexpression elimination
> * Add two properties in CodegenSupport trait, the reusable expressions and
> the the output attributes, we can reuse the expression results only if the
> output attributes are the same.
> * Visit all operators from top to bottom, bound the candidate expressions
> with the output attributes and add to the current candidate reusable
> expressions.
> * Visit all operators from bottom to top, collect all the common expressions
> to the current operator, and add the initialize code to the current operator
> if the common expressions have not been initialized.
> * Replace the common expressions code when generating codes for the
> physical operators.
> h1. *New support subexpression elimination patterns*
> *
> h2. *Support subexpression elimination with conditional expressions*
> {code:java}
> SELECT case when v + 2 > 1 then 1
> when v + 1 > 2 then 2
> when v + 1 > 3 then 3 END vv
> FROM values(1) as t2(v)
> {code}
> We can reuse the result of expression *v + 1*
> {code:java}
> SELECT a, max(if(a > 0, b + c, null)) max_bc, min(if(a > 1, b + c, null))
> min_bc
> FROM values(1, 1, 1) as t(a, b, c)
> GROUP BY a
> {code}
> We can reuse the result of expression b + c
> *
> h2. *Support subexpression elimination in FilterExec*
>
> {code:java}
> SELECT * FROM (
> SELECT v * v + 1 v1 from values(1) as t2(v)
> ) t
> where v1 > 5 and v1 < 10
> {code}
> We can reuse the result of expression *v* * *v* *+* *1*
> *
> h2. *Support subexpression elimination in JoinExec*
>
> {code:java}
> SELECT *
> FROM values(1, 1) as t1(a, b)
> join values(1, 2) as t2(x, y)
> ON b * y between 2 and 3{code}
>
> We can reuse the result of expression *b* * *y*
> *
> h2. *Support subexpression elimination in ExpandExec*
> {code:java}
> SELECT a, count(b),
> count(distinct case when b > 1 then b + c else null end) as count_bc_1,
> count(distinct case when b < 0 then b + c else null end) as count_bc_2
> FROM values(1, 1, 1) as t(a, b, c)
> GROUP BY a
> {code}
> We can reuse the result of expression b + c
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]