[
https://issues.apache.org/jira/browse/IMPALA-7655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16650895#comment-16650895
]
Paul Rogers edited comment on IMPALA-7655 at 10/16/18 12:24 AM:
----------------------------------------------------------------
It turns out that the parser treats {{IF}} a function (which makes sense, since
that is how the SQL syntax works). Suppose we have:
{code:sql}
SELECT IF(x=10, "yes", "no")
FROM ...
{code}
The parser rule for this is:
{code}
| KW_IF LPAREN expr_list:exprs RPAREN
{: RESULT = new FunctionCallExpr("if", exprs); :}
{code}
[~tarmstrong] explained that we we have native implementations of these
functions and we generate a "wrapper" which turns around and calls the
implementation. So, it seems that simply having a native implementation of the
function is not sufficient performant; there is apparently overhead in
shuffling arguments to the function and back.
The alternative is to treat {{IF}} and similar functions as operations, not
functions. For one thing, this could result in fewer CPU cycles since we need
to evaluate only the "then" or "else" clause, but not both. (We have to
evaluate both when {{IF}} is implemented as a function.)
So, if we need to codegen the operation, we'd either need to implement codegen
rules for each conditional operation (looks to be very complex), or reuse those
we already have ([~tarmstrong]'s suggestion to reuse {{CASE}}.)
[~philip] suggest that the rewrite rules should apply to [all of
them|https://impala.apache.org/docs/build3x/html/topics/impala_conditional_functions.html]:
{noformat}
if(boolean condition, type ifTrue, type ifFalseOrNull)
ifnull(type a, type ifNull)
isfalse(boolean)
isnotfalse(boolean)
isnottrue(boolean)
isnull(type a, type ifNull)
istrue(boolean)
nonnullvalue(expression)
nullif(expr1,expr2)
nullifzero(numeric_expr)
nullvalue(expression)
nvl(type a, type ifNull)
nvl2(type a, type ifNull, type ifNotNull)
zeroifnull(numeric_expr)
{noformat}
A quick check of the C++ implementation shows implementations for {{IS NULL}}
and {{IS NOT NULL}} in {{is-null-predicate.cc}}. Also, in {{scalar-expr.cc}}
there is special-case code for {{IF}}, {{ISNULL}}, {{IFNULL}}, {{NVL}} and
{{COALESE}}.
The cleanest, most general is to parse all conditional ({{IF}}-like functions)
into a new "Conditional" parse node. That node would capture the four relevant
attributes: operation, condition, then-expr and else-expr. Doing so would
eliminate the awkward "if is a function but not really" logic that currently
appears in both the FE and BE. The drawback is that this looks like quite a bit
of work.
A quick & dirty alternative is to parse conditionals directly into a {{CASE}}
statement:
{code}
| KW_IF LPAREN expr_list:exprs RPAREN
{: RESULT = new CaseExpr(null, <get if/then exprs>, <get else expr>); :}
{code}
Doing so, however, would bypass the rewrite rules in
{{SimplifyConditionalsRule}} and may have other unintended side effects.
Perhaps the quickest (if not cleanest) solution might be to live with the
current implementation, but extend {{SimplifyConditionalsRule}} which already
handles some of the these functions ({{if}}, {{coalesce}}, {{ifnull}},
{{isnull}}, and {{nvl}}). Today, it just looks for obvious optimizations
({{if(true, x, y)}}, etc.) These can be extended to first apply the
optimizations, then rewrite the expression into a {{CASE}} statement (using the
{{CaseExpr}} node class).
The trick will be to:
1. Ensure we have (or can create) tests for each conditional to verify current
functionality.
2. Identify the existing special handling of each conditional which will be
removed as the rewrite rules are added.
3. One-by-one, add rewrite rules for each conditional, removing existing
special code.
4. Retest using the tests created in step 1 to verify correct behavior.
The unit tests in {{ExprRewriteRulesTest}} would be extended and/or modified to
test the new transformations.
The {{tests/query_test/test_exprs.py}} file looks like it should help with step
1 above, but it is not a complete test of all the conditional functions. A
quick search for "isnotfalse" (a rather unique SQL function name) found no
matches in Python files outside of the function table mentioned below. So,
looks like detailed tests have to be created. Still, the {{test_exprs.py}} file
can serve as a template for generating such tests.
was (Author: paul.rogers):
It turns out that the parser treats {{IF}} a function (which makes sense, since
that is how the SQL syntax works). Suppose we have:
{code:sql}
SELECT IF(x=10, "yes", "no")
FROM ...
{code}
The parser rule for this is:
{code}
| KW_IF LPAREN expr_list:exprs RPAREN
{: RESULT = new FunctionCallExpr("if", exprs); :}
{code}
[~tarmstrong] explained that we we have native implementations of these
functions and we generate a "wrapper" which turns around and calls the
implementation. So, it seems that simply having a native implementation of the
function is not sufficient performant; there is apparently overhead in
shuffling arguments to the function and back.
The alternative is to treat {{IF}} and similar functions as operations, not
functions. For one thing, this could result in fewer CPU cycles since we need
to evaluate only the "then" or "else" clause, but not both. (We have to
evaluate both when {{IF}} is implemented as a function.)
So, if we need to codegen the operation, we'd either need to implement codegen
rules for each conditional operation (looks to be very complex), or reuse those
we already have ([~tarmstrong]'s suggestion to reuse {{CASE}}.)
[~philip] suggest that the rewrite rules should apply to [all of
them|https://impala.apache.org/docs/build3x/html/topics/impala_conditional_functions.html]:
{noformat}
if(boolean condition, type ifTrue, type ifFalseOrNull)
ifnull(type a, type ifNull)
isfalse(boolean)
isnotfalse(boolean)
isnottrue(boolean)
isnull(type a, type ifNull)
istrue(boolean)
nonnullvalue(expression)
nullif(expr1,expr2)
nullifzero(numeric_expr)
nullvalue(expression)
nvl(type a, type ifNull)
nvl2(type a, type ifNull, type ifNotNull)
zeroifnull(numeric_expr)
{noformat}
A quick check of the C++ implementation shows implementations for {{IS NULL}}
and {{IS NOT NULL}} in {{is-null-predicate.cc}}. Also, in {{scalar-expr.cc}}
there is special-case code for {{IF}}, {{ISNULL}}, {{IFNULL}}, {{NVL}} and
{{COALESE}}.
The cleanest, most general is to parse all conditional ({{IF}}-like functions)
into a new "Conditional" parse node. That node would capture the four relevant
attributes: operation, condition, then-expr and else-expr. Doing so would
eliminate the awkward "if is a function but not really" logic that currently
appears in both the FE and BE. The drawback is that this looks like quite a bit
of work.
A quick & dirty alternative is to parse conditionals directly into a {{CASE}}
statement:
{code}
| KW_IF LPAREN expr_list:exprs RPAREN
{: RESULT = new CaseExpr(null, <get if/then exprs>, <get else expr>); :}
{code}
Doing so, however, would bypass the rewrite rules in
{{SimplifyConditionalsRule}} and may have other unintended side effects.
Perhaps the quickest (if not cleanest) solution might be to live with the
current implementation, but extend {{SimplifyConditionalsRule}} which already
handles some of the these functions ({{if}}, {{coalesce}}, {{ifnull}},
{{isnull}}, and {{nvl}}). Today, it just looks for obvious optimizations
({{if(true, x, y)}}, etc.) These can be extended to first apply the
optimizations, then rewrite the expression into a {{CASE}} statement (using the
{{CaseExpr}} node class).
The trick will be to:
1. Ensure we have (or can create) tests for each conditional to verify current
functionality.
2. Identify the existing special handling of each conditional which will be
removed as the rewrite rules are added.
3. One-by-one, add rewrite rules for each conditional, removing existing
special code.
4. Retest using the tests created in step 1 to verify correct behavior.
The unit tests in {{ExprRewriteRulesTest}} would be extended and/or modified to
test the new transformations.
> Codegen output for conditional functions (if,isnull, coalesce) is very
> suboptimal
> ---------------------------------------------------------------------------------
>
> Key: IMPALA-7655
> URL: https://issues.apache.org/jira/browse/IMPALA-7655
> Project: IMPALA
> Issue Type: Improvement
> Components: Backend
> Reporter: Tim Armstrong
> Priority: Major
> Labels: codegen, perf, performance
>
> https://gerrit.cloudera.org/#/c/11565/ provided a clue that an aggregation
> involving an if() function was very slow, 10x slower than the equivalent
> version using a case:
> {noformat}
> [localhost:21000] default> set num_nodes=1; set mt_dop=1; select count(case
> when l_orderkey is NULL then 1 else NULL end) from
> tpch10_parquet.lineitem;summary;
> NUM_NODES set to 1
> MT_DOP set to 1
> Query: select count(case when l_orderkey is NULL then 1 else NULL end) from
> tpch10_parquet.lineitem
> Query submitted at: 2018-10-04 11:17:31 (Coordinator:
> http://tarmstrong-box:25000)
> Query progress can be monitored at:
> http://tarmstrong-box:25000/query_plan?query_id=274b2a6f35cefe31:95a1964200000000
> +----------------------------------------------------------+
> | count(case when l_orderkey is null then 1 else null end) |
> +----------------------------------------------------------+
> | 0 |
> +----------------------------------------------------------+
> Fetched 1 row(s) in 0.51s
> +--------------+--------+----------+----------+--------+------------+----------+---------------+-------------------------+
> | Operator | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak
> Mem | Est. Peak Mem | Detail |
> +--------------+--------+----------+----------+--------+------------+----------+---------------+-------------------------+
> | 01:AGGREGATE | 1 | 44.03ms | 44.03ms | 1 | 1 | 25.00
> KB | 10.00 MB | FINALIZE |
> | 00:SCAN HDFS | 1 | 411.57ms | 411.57ms | 59.99M | -1 | 16.61
> MB | 88.00 MB | tpch10_parquet.lineitem |
> +--------------+--------+----------+----------+--------+------------+----------+---------------+-------------------------+
> [localhost:21000] default> set num_nodes=1; set mt_dop=1; select
> count(if(l_orderkey is NULL, 1, NULL)) from tpch10_parquet.lineitem;summary;
> NUM_NODES set to 1
> MT_DOP set to 1
> Query: select count(if(l_orderkey is NULL, 1, NULL)) from
> tpch10_parquet.lineitem
> Query submitted at: 2018-10-04 11:23:07 (Coordinator:
> http://tarmstrong-box:25000)
> Query progress can be monitored at:
> http://tarmstrong-box:25000/query_plan?query_id=8e46ab1b84c4dbff:2786ca2600000000
> +----------------------------------------+
> | count(if(l_orderkey is null, 1, null)) |
> +----------------------------------------+
> | 0 |
> +----------------------------------------+
> Fetched 1 row(s) in 1.01s
> +--------------+--------+----------+----------+--------+------------+----------+---------------+-------------------------+
> | Operator | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak
> Mem | Est. Peak Mem | Detail |
> +--------------+--------+----------+----------+--------+------------+----------+---------------+-------------------------+
> | 01:AGGREGATE | 1 | 422.07ms | 422.07ms | 1 | 1 | 25.00
> KB | 10.00 MB | FINALIZE |
> | 00:SCAN HDFS | 1 | 511.13ms | 511.13ms | 59.99M | -1 | 16.61
> MB | 88.00 MB | tpch10_parquet.lineitem |
> +--------------+--------+----------+----------+--------+------------+----------+---------------+-------------------------+
> {noformat}
> It turns out that this is because we don't have good codegen support for
> ConditionalFunction, and just fall back to emitting a call to the interpreted
> path:
> https://github.com/apache/impala/blob/master/be/src/exprs/conditional-functions.cc#L28
> See CaseExpr for an example of much better codegen support:
> https://github.com/apache/impala/blob/master/be/src/exprs/case-expr.cc#L178
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]