[jira] [Comment Edited] (IMPALA-7655) Codegen output for conditional functions (if,isnull, coalesce) is very suboptimal

Paul Rogers (JIRA) Mon, 15 Oct 2018 17:25:25 -0700


    [ 
https://issues.apache.org/jira/browse/IMPALA-7655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16650895#comment-16650895
 ]


Paul Rogers edited comment on IMPALA-7655 at 10/16/18 12:24 AM:
----------------------------------------------------------------

It turns out that the parser treats {{IF}} a function (which makes sense, since 
that is how the SQL syntax works). Suppose we have:

{code:sql}
SELECT IF(x=10, "yes", "no")
FROM ...
{code}

The parser rule for this is:

{code}
  | KW_IF LPAREN expr_list:exprs RPAREN
  {: RESULT = new FunctionCallExpr("if", exprs); :}
{code}

[~tarmstrong] explained that we we have native implementations of these 
functions and we generate a "wrapper" which turns around and calls the 
implementation. So, it seems that simply having a native implementation of the 
function is not sufficient performant; there is apparently overhead in 
shuffling arguments to the function and back.

The alternative is to treat {{IF}} and similar functions as operations, not 
functions. For one thing, this could result in fewer CPU cycles since we need 
to evaluate only the "then" or "else" clause, but not both. (We have to 
evaluate both when {{IF}} is implemented as a function.)

So, if we need to codegen the operation, we'd either need to implement codegen 
rules for each conditional operation (looks to be very complex), or reuse those 
we already have ([~tarmstrong]'s suggestion to reuse {{CASE}}.)

[~philip] suggest that the rewrite rules should apply to [all of 
them|https://impala.apache.org/docs/build3x/html/topics/impala_conditional_functions.html]:

{noformat}
if(boolean condition, type ifTrue, type ifFalseOrNull)
ifnull(type a, type ifNull)
isfalse(boolean)
isnotfalse(boolean)
isnottrue(boolean)
isnull(type a, type ifNull)
istrue(boolean)
nonnullvalue(expression)
nullif(expr1,expr2)
nullifzero(numeric_expr)
nullvalue(expression)
nvl(type a, type ifNull)
nvl2(type a, type ifNull, type ifNotNull)
zeroifnull(numeric_expr)
{noformat}

A quick check of the C++ implementation shows implementations for {{IS NULL}} 
and {{IS NOT NULL}} in {{is-null-predicate.cc}}. Also, in {{scalar-expr.cc}} 
there is special-case code for {{IF}}, {{ISNULL}}, {{IFNULL}}, {{NVL}} and 
{{COALESE}}.

The cleanest, most general is to parse all conditional ({{IF}}-like functions) 
into a new "Conditional" parse node. That node would capture the four relevant 
attributes: operation, condition, then-expr and else-expr. Doing so would 
eliminate the awkward "if is a function but not really" logic that currently 
appears in both the FE and BE. The drawback is that this looks like quite a bit 
of work.

A quick & dirty alternative is to parse conditionals directly into a {{CASE}} 
statement:

{code}
  | KW_IF LPAREN expr_list:exprs RPAREN
  {: RESULT = new CaseExpr(null, <get if/then exprs>, <get else expr>); :}
{code}

Doing so, however, would bypass the rewrite rules in 
{{SimplifyConditionalsRule}} and may have other unintended side effects.

Perhaps the quickest (if not cleanest) solution might be to live with the 
current implementation, but extend {{SimplifyConditionalsRule}} which already 
handles some of the these functions ({{if}}, {{coalesce}}, {{ifnull}}, 
{{isnull}}, and {{nvl}}). Today, it just looks for obvious optimizations 
({{if(true, x, y)}}, etc.) These can be extended to first apply the 
optimizations, then rewrite the expression into a {{CASE}} statement (using the 
{{CaseExpr}} node class).

The trick will be to:

1. Ensure we have (or can create) tests for each conditional to verify current 
functionality.
2. Identify the existing special handling of each conditional which will be 
removed as the rewrite rules are added.
3. One-by-one, add rewrite rules for each conditional, removing existing 
special code.
4. Retest using the tests created in step 1 to verify correct behavior.

The unit tests in {{ExprRewriteRulesTest}} would be extended and/or modified to 
test the new transformations.

The {{tests/query_test/test_exprs.py}} file looks like it should help with step 
1 above, but it is not a complete test of all the conditional functions. A 
quick search for "isnotfalse" (a rather unique SQL function name) found no 
matches in Python files outside of the function table mentioned below. So, 
looks like detailed tests have to be created. Still, the {{test_exprs.py}} file 
can serve as a template for generating such tests.


was (Author: paul.rogers):
It turns out that the parser treats {{IF}} a function (which makes sense, since 
that is how the SQL syntax works). Suppose we have:

{code:sql}
SELECT IF(x=10, "yes", "no")
FROM ...
{code}

The parser rule for this is:

{code}
  | KW_IF LPAREN expr_list:exprs RPAREN
  {: RESULT = new FunctionCallExpr("if", exprs); :}
{code}

[~tarmstrong] explained that we we have native implementations of these 
functions and we generate a "wrapper" which turns around and calls the 
implementation. So, it seems that simply having a native implementation of the 
function is not sufficient performant; there is apparently overhead in 
shuffling arguments to the function and back.

The alternative is to treat {{IF}} and similar functions as operations, not 
functions. For one thing, this could result in fewer CPU cycles since we need 
to evaluate only the "then" or "else" clause, but not both. (We have to 
evaluate both when {{IF}} is implemented as a function.)

So, if we need to codegen the operation, we'd either need to implement codegen 
rules for each conditional operation (looks to be very complex), or reuse those 
we already have ([~tarmstrong]'s suggestion to reuse {{CASE}}.)

[~philip] suggest that the rewrite rules should apply to [all of 
them|https://impala.apache.org/docs/build3x/html/topics/impala_conditional_functions.html]:

{noformat}
if(boolean condition, type ifTrue, type ifFalseOrNull)
ifnull(type a, type ifNull)
isfalse(boolean)
isnotfalse(boolean)
isnottrue(boolean)
isnull(type a, type ifNull)
istrue(boolean)
nonnullvalue(expression)
nullif(expr1,expr2)
nullifzero(numeric_expr)
nullvalue(expression)
nvl(type a, type ifNull)
nvl2(type a, type ifNull, type ifNotNull)
zeroifnull(numeric_expr)
{noformat}

A quick check of the C++ implementation shows implementations for {{IS NULL}} 
and {{IS NOT NULL}} in {{is-null-predicate.cc}}. Also, in {{scalar-expr.cc}} 
there is special-case code for {{IF}}, {{ISNULL}}, {{IFNULL}}, {{NVL}} and 
{{COALESE}}.

The cleanest, most general is to parse all conditional ({{IF}}-like functions) 
into a new "Conditional" parse node. That node would capture the four relevant 
attributes: operation, condition, then-expr and else-expr. Doing so would 
eliminate the awkward "if is a function but not really" logic that currently 
appears in both the FE and BE. The drawback is that this looks like quite a bit 
of work.

A quick & dirty alternative is to parse conditionals directly into a {{CASE}} 
statement:

{code}
  | KW_IF LPAREN expr_list:exprs RPAREN
  {: RESULT = new CaseExpr(null, <get if/then exprs>, <get else expr>); :}
{code}

Doing so, however, would bypass the rewrite rules in 
{{SimplifyConditionalsRule}} and may have other unintended side effects.

Perhaps the quickest (if not cleanest) solution might be to live with the 
current implementation, but extend {{SimplifyConditionalsRule}} which already 
handles some of the these functions ({{if}}, {{coalesce}}, {{ifnull}}, 
{{isnull}}, and {{nvl}}). Today, it just looks for obvious optimizations 
({{if(true, x, y)}}, etc.) These can be extended to first apply the 
optimizations, then rewrite the expression into a {{CASE}} statement (using the 
{{CaseExpr}} node class).

The trick will be to:

1. Ensure we have (or can create) tests for each conditional to verify current 
functionality.
2. Identify the existing special handling of each conditional which will be 
removed as the rewrite rules are added.
3. One-by-one, add rewrite rules for each conditional, removing existing 
special code.
4. Retest using the tests created in step 1 to verify correct behavior.

The unit tests in {{ExprRewriteRulesTest}} would be extended and/or modified to 
test the new transformations.

> Codegen output for conditional functions (if,isnull, coalesce) is very 
> suboptimal
> ---------------------------------------------------------------------------------
>
>                 Key: IMPALA-7655
>                 URL: https://issues.apache.org/jira/browse/IMPALA-7655
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Backend
>            Reporter: Tim Armstrong
>            Priority: Major
>              Labels: codegen, perf, performance
>
> https://gerrit.cloudera.org/#/c/11565/ provided a clue that an aggregation 
> involving an if() function was very slow, 10x slower than the equivalent 
> version using a case:
> {noformat}
> [localhost:21000] default> set num_nodes=1; set mt_dop=1; select count(case 
> when l_orderkey is NULL then 1 else NULL end) from 
> tpch10_parquet.lineitem;summary;
> NUM_NODES set to 1
> MT_DOP set to 1
> Query: select count(case when l_orderkey is NULL then 1 else NULL end) from 
> tpch10_parquet.lineitem
> Query submitted at: 2018-10-04 11:17:31 (Coordinator: 
> http://tarmstrong-box:25000)
> Query progress can be monitored at: 
> http://tarmstrong-box:25000/query_plan?query_id=274b2a6f35cefe31:95a1964200000000
> +----------------------------------------------------------+
> | count(case when l_orderkey is null then 1 else null end) |
> +----------------------------------------------------------+
> | 0                                                        |
> +----------------------------------------------------------+
> Fetched 1 row(s) in 0.51s
> +--------------+--------+----------+----------+--------+------------+----------+---------------+-------------------------+
> | Operator     | #Hosts | Avg Time | Max Time | #Rows  | Est. #Rows | Peak 
> Mem | Est. Peak Mem | Detail                  |
> +--------------+--------+----------+----------+--------+------------+----------+---------------+-------------------------+
> | 01:AGGREGATE | 1      | 44.03ms  | 44.03ms  | 1      | 1          | 25.00 
> KB | 10.00 MB      | FINALIZE                |
> | 00:SCAN HDFS | 1      | 411.57ms | 411.57ms | 59.99M | -1         | 16.61 
> MB | 88.00 MB      | tpch10_parquet.lineitem |
> +--------------+--------+----------+----------+--------+------------+----------+---------------+-------------------------+
> [localhost:21000] default> set num_nodes=1; set mt_dop=1; select 
> count(if(l_orderkey is NULL, 1, NULL)) from tpch10_parquet.lineitem;summary;
> NUM_NODES set to 1
> MT_DOP set to 1
> Query: select count(if(l_orderkey is NULL, 1, NULL)) from 
> tpch10_parquet.lineitem
> Query submitted at: 2018-10-04 11:23:07 (Coordinator: 
> http://tarmstrong-box:25000)
> Query progress can be monitored at: 
> http://tarmstrong-box:25000/query_plan?query_id=8e46ab1b84c4dbff:2786ca2600000000
> +----------------------------------------+
> | count(if(l_orderkey is null, 1, null)) |
> +----------------------------------------+
> | 0                                      |
> +----------------------------------------+
> Fetched 1 row(s) in 1.01s
> +--------------+--------+----------+----------+--------+------------+----------+---------------+-------------------------+
> | Operator     | #Hosts | Avg Time | Max Time | #Rows  | Est. #Rows | Peak 
> Mem | Est. Peak Mem | Detail                  |
> +--------------+--------+----------+----------+--------+------------+----------+---------------+-------------------------+
> | 01:AGGREGATE | 1      | 422.07ms | 422.07ms | 1      | 1          | 25.00 
> KB | 10.00 MB      | FINALIZE                |
> | 00:SCAN HDFS | 1      | 511.13ms | 511.13ms | 59.99M | -1         | 16.61 
> MB | 88.00 MB      | tpch10_parquet.lineitem |
> +--------------+--------+----------+----------+--------+------------+----------+---------------+-------------------------+
> {noformat}
> It turns out that this is because we don't have good codegen support for 
> ConditionalFunction, and just fall back to emitting a call to the interpreted 
> path: 
> https://github.com/apache/impala/blob/master/be/src/exprs/conditional-functions.cc#L28
> See CaseExpr for an example of much better codegen support: 
> https://github.com/apache/impala/blob/master/be/src/exprs/case-expr.cc#L178



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (IMPALA-7655) Codegen output for conditional functions (if,isnull, coalesce) is very suboptimal

Reply via email to