[ 
https://issues.apache.org/jira/browse/IMPALA-7655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16650895#comment-16650895
 ] 

Paul Rogers edited comment on IMPALA-7655 at 10/16/18 12:11 AM:
----------------------------------------------------------------

It turns out that the parser treats {{IF}} a function (which makes sense, since 
that is how the SQL syntax works). Suppose we have:

{code:sql}
SELECT IF(x=10, "yes", "no")
FROM ...
{code}

The parser rule for this is:

{code}
  | KW_IF LPAREN expr_list:exprs RPAREN
  {: RESULT = new FunctionCallExpr("if", exprs); :}
{code}

[~tarmstrong] explained that we we have native implementations of these 
functions and we generate a "wrapper" which turns around and calls the 
implementation. So, it seems that simply having a native implementation of the 
function is not sufficient performant; there is apparently overhead in 
shuffling arguments to the function and back.

The alternative is to treat {{IF}} and similar functions as operations, not 
functions. For one thing, this could result in fewer CPU cycles since we need 
to evaluate only the "then" or "else" clause, but not both. (We have to 
evaluate both when {{IF}} is implemented as a function.)

So, if we need to codegen the operation, we'd either need to implement codegen 
rules for each conditional operation (looks to be very complex), or reuse those 
we already have ([~tarmstrong]'s suggestion to reuse {{CASE}}.)

[~philip] suggest that the rewrite rules should apply to [all of 
them|https://impala.apache.org/docs/build3x/html/topics/impala_conditional_functions.html]:

{noformat}
if(boolean condition, type ifTrue, type ifFalseOrNull)
ifnull(type a, type ifNull)
isfalse(boolean)
isnotfalse(boolean)
isnottrue(boolean)
isnull(type a, type ifNull)
istrue(boolean)
nonnullvalue(expression)
nullif(expr1,expr2)
nullifzero(numeric_expr)
nullvalue(expression)
nvl(type a, type ifNull)
nvl2(type a, type ifNull, type ifNotNull)
zeroifnull(numeric_expr)
{noformat}

A quick check of the C++ implementation shows implementations for {{IS NULL}} 
and {{IS NOT NULL}} in {{is-null-predicate.cc}}. Also, in {{scalar-expr.cc}} 
there is special-case code for {{IF}}, {{ISNULL}}, {{IFNULL}}, {{NVL}} and 
{{COALESE}}.

The cleanest, most general is to parse all conditional ({{IF}}-like functions) 
into a new "Conditional" parse node. That node would capture the four relevant 
attributes: operation, condition, then-expr and else-expr. Doing so would 
eliminate the awkward "if is a function but not really" logic that currently 
appears in both the FE and BE. The drawback is that this looks like quite a bit 
of work.

A quick & dirty alternative is to parse conditionals directly into a {{CASE}} 
statement:

{code}
  | KW_IF LPAREN expr_list:exprs RPAREN
  {: RESULT = new CaseExpr(null, <get if/then exprs>, <get else expr>); :}
{code}

Doing so, however, would bypass the rewrite rules in 
{{SimplifyConditionalsRule}} and may have other unintended side effects.

Perhaps the quickest (if not cleanest) solution might be to live with the 
current implementation, but extend {{SimplifyConditionalsRule}} which already 
handles some of the these functions ({{if}}, {{coalesce}}, {{ifnull}}, 
{{isnull}}, and {{nvl}}). Today, it just looks for obvious optimizations 
({{if(true, x, y)}}, etc.) These can be extended to first apply the 
optimizations, then rewrite the expression into a {{CASE}} statement (using the 
{{CaseExpr}} node class).

The trick will be to:

1. Ensure we have (or can create) tests for each conditional to verify current 
functionality.
2. Identify the existing special handling of each conditional which will be 
removed as the rewrite rules are added.
3. One-by-one, add rewrite rules for each conditional, removing existing 
special code.
4. Retest using the tests created in step 1 to verify correct behavior.

The unit tests in {{ExprRewriteRulesTest}} would be extended and/or modified to 
test the new transformations.


was (Author: paul.rogers):
Considering the frontend rewrite, it turns out that IF is treated as a 
function. Suppose we have:

{code:sql}
SELECT IF(x=10, "yes", "no")
FROM ...
{code}

The parser rule for this is:

{code}
  | KW_IF LPAREN expr_list:exprs RPAREN
  {: RESULT = new FunctionCallExpr("if", exprs); :}
{code}

This raises a question: Does Impala have native implementations for each of its 
functions? A quick check of the C++ implementation shows implementations for 
{{IS NULL}} and {{IS NOT NULL}} in {{s-null-predicate.cc}}. Also, in 
{{scalar-expr.cc}} there is special-case code for {{IF}}, {{ISNULL}}, 
{{IFNULL}}, {{NVL}} and {{COALESE}}.

If not, then a general solution would be to create such implementations. 

[~tarmstrong]'s specific suggestion still holds: just rewrite this one 
expression. Do we have native implementations for other conditional functions? 
If not, then as [~philip] suggests, the rewrite rules should apply to [all of 
them|https://impala.apache.org/docs/build3x/html/topics/impala_conditional_functions.html]:

{noformat}
if(boolean condition, type ifTrue, type ifFalseOrNull)
ifnull(type a, type ifNull)
isfalse(boolean)
isnotfalse(boolean)
isnottrue(boolean)
isnull(type a, type ifNull)
istrue(boolean)
nonnullvalue(expression)
nullif(expr1,expr2)
nullifzero(numeric_expr)
nullvalue(expression)
nvl(type a, type ifNull)
nvl2(type a, type ifNull, type ifNotNull)
zeroifnull(numeric_expr)
{noformat}

To rewrite, we'd have to analyze functions looking for ways of rewriting a 
given function into something else (in this case, into a {{CASE}} expression.)

One possible way to handle this is to parse all conditional ({{IF}}-like 
functions) directly into a {{CASE}} statement:

{code}
  | KW_IF LPAREN expr_list:exprs RPAREN
  {: RESULT = new CaseExpr(null, <get if/then exprs>, <get else expr>); :}
{code}

Doing so, however, would bypass the rewrite rules in 
{{SimplifyConditionalsRule}}.

Another possibility is to create a new parse note to represent conditional 
functions (which would avoid the need for the awkward special-casing in the 
planner and engine), but that looks like quite a bit of work.

The quickest (if not cleanest) solution appears to be to live with the current 
implementation, but extend {{SimplifyConditionalsRule}} which already handles 
some of the these functions ({{if}}, {{coalesce}}, {{ifnull}}, {{isnull}}, and 
{{nvl}}). Today, it just looks for obvious optimizations ({{if(true, x, y)}}, 
etc.) These can be extended to first apply the optimizations, then rewrite the 
expression into a {{CASE}} statement ({{CaseExpr}}).

The unit tests in {{ExprRewriteRulesTest}} would be extended and/or modified to 
test the new transformations.
 

> Codegen output for conditional functions (if,isnull, coalesce) is very 
> suboptimal
> ---------------------------------------------------------------------------------
>
>                 Key: IMPALA-7655
>                 URL: https://issues.apache.org/jira/browse/IMPALA-7655
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Backend
>            Reporter: Tim Armstrong
>            Priority: Major
>              Labels: codegen, perf, performance
>
> https://gerrit.cloudera.org/#/c/11565/ provided a clue that an aggregation 
> involving an if() function was very slow, 10x slower than the equivalent 
> version using a case:
> {noformat}
> [localhost:21000] default> set num_nodes=1; set mt_dop=1; select count(case 
> when l_orderkey is NULL then 1 else NULL end) from 
> tpch10_parquet.lineitem;summary;
> NUM_NODES set to 1
> MT_DOP set to 1
> Query: select count(case when l_orderkey is NULL then 1 else NULL end) from 
> tpch10_parquet.lineitem
> Query submitted at: 2018-10-04 11:17:31 (Coordinator: 
> http://tarmstrong-box:25000)
> Query progress can be monitored at: 
> http://tarmstrong-box:25000/query_plan?query_id=274b2a6f35cefe31:95a1964200000000
> +----------------------------------------------------------+
> | count(case when l_orderkey is null then 1 else null end) |
> +----------------------------------------------------------+
> | 0                                                        |
> +----------------------------------------------------------+
> Fetched 1 row(s) in 0.51s
> +--------------+--------+----------+----------+--------+------------+----------+---------------+-------------------------+
> | Operator     | #Hosts | Avg Time | Max Time | #Rows  | Est. #Rows | Peak 
> Mem | Est. Peak Mem | Detail                  |
> +--------------+--------+----------+----------+--------+------------+----------+---------------+-------------------------+
> | 01:AGGREGATE | 1      | 44.03ms  | 44.03ms  | 1      | 1          | 25.00 
> KB | 10.00 MB      | FINALIZE                |
> | 00:SCAN HDFS | 1      | 411.57ms | 411.57ms | 59.99M | -1         | 16.61 
> MB | 88.00 MB      | tpch10_parquet.lineitem |
> +--------------+--------+----------+----------+--------+------------+----------+---------------+-------------------------+
> [localhost:21000] default> set num_nodes=1; set mt_dop=1; select 
> count(if(l_orderkey is NULL, 1, NULL)) from tpch10_parquet.lineitem;summary;
> NUM_NODES set to 1
> MT_DOP set to 1
> Query: select count(if(l_orderkey is NULL, 1, NULL)) from 
> tpch10_parquet.lineitem
> Query submitted at: 2018-10-04 11:23:07 (Coordinator: 
> http://tarmstrong-box:25000)
> Query progress can be monitored at: 
> http://tarmstrong-box:25000/query_plan?query_id=8e46ab1b84c4dbff:2786ca2600000000
> +----------------------------------------+
> | count(if(l_orderkey is null, 1, null)) |
> +----------------------------------------+
> | 0                                      |
> +----------------------------------------+
> Fetched 1 row(s) in 1.01s
> +--------------+--------+----------+----------+--------+------------+----------+---------------+-------------------------+
> | Operator     | #Hosts | Avg Time | Max Time | #Rows  | Est. #Rows | Peak 
> Mem | Est. Peak Mem | Detail                  |
> +--------------+--------+----------+----------+--------+------------+----------+---------------+-------------------------+
> | 01:AGGREGATE | 1      | 422.07ms | 422.07ms | 1      | 1          | 25.00 
> KB | 10.00 MB      | FINALIZE                |
> | 00:SCAN HDFS | 1      | 511.13ms | 511.13ms | 59.99M | -1         | 16.61 
> MB | 88.00 MB      | tpch10_parquet.lineitem |
> +--------------+--------+----------+----------+--------+------------+----------+---------------+-------------------------+
> {noformat}
> It turns out that this is because we don't have good codegen support for 
> ConditionalFunction, and just fall back to emitting a call to the interpreted 
> path: 
> https://github.com/apache/impala/blob/master/be/src/exprs/conditional-functions.cc#L28
> See CaseExpr for an example of much better codegen support: 
> https://github.com/apache/impala/blob/master/be/src/exprs/case-expr.cc#L178



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to