[ 
https://issues.apache.org/jira/browse/IMPALA-7655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16657532#comment-16657532
 ] 

Paul Rogers edited comment on IMPALA-7655 at 10/22/18 8:06 PM:
---------------------------------------------------------------

This note summarizes the state of each of the Impala conditional functions to 
identify the path needed to implement the requested changes, and to ensure that 
the changes don't impact other functionality. We also point out a few 
out-of-scope nice-to-haves as we go along.

In general, all the action here is in just a few places:

* {{sql-parser.cup}} in which syntax is reduced to parse nodes such as 
functions or operators. The parser unifies certain constructs such as {{<=>}} 
and {{IS NOT DISTINCT FROM}}.
* {{FunctionCallExpr.createExpr()}} is given a function-like definition and 
converts some of them to other forms ({{decode()}}, {{nvl2(}}, {{nullif()}}. A 
nice-to-have would be to move this logic to 
{{SimplifyConditionalsRule.apply()}} so we have a uniform way of doing 
transforms.
* {{SimplifyConditionalsRule}} does a great many transforms of various 
conditional rules. (We will add more for this task.)
* {{impala_functions.py}} in the BE provides a mapping from remaining functions 
(those not optimized away above) to implementations. All functions listed here 
are cross-compiled into LLVM along with a generated wrapper function that binds 
the function to its set of arguments.
* {{conditional-functions.[h|cc]}} handles special case functions that require 
short-circuit argument evaluation ({{isull()}}, {{if()}}, {{coalesce()}}). 
These three functions are never code generated. The goal of this task is to 
convert these into a code generated for using {{CASE}}.

For all expressions, the planner does a check for all-constant expressions 
(such as {{NULL IS NOT NULL}} or {{(10 = 9) IS TRUE}}) and replaces them with 
the result of the expression by using the BE to interpret the partial 
constant-only expression tree. As a result, the rewrite steps focus on the 
non-trivial cases that require knowledge of the semantics of a given function.

In the suggestions that follow, we rewrite certain functions into {{CASE}}. 
But, in so doing, we end up evaluating certain terms twice. IMPALA-7737 asks to 
resolve that issue.

Below is a summary of each conditional function that identifies current state 
and any changes that might be possible.

h4. {{CASE ...}}

BE: Interpreted when in the {{SELECT}} clause (IMPALA-4356). Code generated 
when in the {{WHERE}} clause or in a join.

h4. {{x IS [NOT] (TRUE | FALSE)}}

FE, {{sql-parser.cup}}: captured as a {{FunctionCallExpr}} for the equivalent 
{{ISTRUE\(x)}}, etc. function.

h4. {{x IS [NOT] NULL}}

FE, {{sql-parser.cup}}: captured as a {{IsNullPredicate}}. (Note that this is 
the opposite of {{IS TRUE}}, etc.)

BE: Cross compiled as a UDF: {{IsNullPredicate::Is[Not]Null}}, with wrapper.

h4. {{IS[NOT](TRUE|FALSE)\(x)}}
 
BE: Implemented in {{ConditionalFunctions::IsTrue}}, etc.

h4. {{NULLIF(expr1, expr2)}} \\ {{NVL2(expr, trueExpr, falseExpr)}}

FE, {{FunctionCallExpr}}: {{nullif(expr1, expr2)}} &rarr; {{if(expr1 IS 
DISTINCT FROM expr2, expr1, NULL)}}

{{NULLIF()}} and {{NVL2()}} vanish from the plan after this step. There is no 
entry for {{nullify()}} in {{impala_functions.py}}.

h4. {{ISNULL(a, b)}}

BE: Alias for this method exist in {{impala_functions.py}}, special 
implementation in {{conditional-functions.[h|cc]}}.

*Suggestion:* Rewrite as:

{code:sql}
CASE a IS NULL THEN b ELSE a END
{code}

Since {{isnull()}} would vanish from the plan after this transform, remove the 
BE implementation.

h4. {{NVL(a, b)}} \\  {{IFNULL(a, b)}}

FE, {{SimplifyConditional}}: Treated same as {{ISNULL(a, b)}}, but is not 
rewritten to this form.

BE: Alias for this method exist in {{impala_functions.py}}.

*Suggestion:* Rewrite to {{ISNULL(a, b)}}, drop from {{impala_functions.py}} to 
make things a bit more tidy. (If the suggestion for {{isnull()}} is taken, then 
even {{isnull()}} vanishes from the plan in the planner.

h4. {{[NON]NULLVALUE\(x)}}

An entry in {{impala_functions.py}} maps this method to the compiled {{IS [NOT] 
NULL}} operator implementations.

*Suggestion:* To make {{impala_functions.py}} less messy, add a transform to 
the FE to replace these functions with the operators, and remove the functions' 
entries from {{impala_functions.py}}. This also ensures that all optimization 
applied to the operators is also done for the functions.

h4. {{x <=> (TRUE | FALSE | NULL)}} \\ {{x IS [NOT] DISTINCT FROM (TRUE | FALSE 
| NULL)}}

FE {{sql-parser.cup}}: Parsed (in generic form) into a 
{{BinaryPredicate(BinaryPredicate.Operator.(NOT_DISTINCT|DISTINCT_FROM)...)}}

BE: Implemented code generated {{Operators::NotDistinct_BooleanVal_BooleanVal}}.

*Suggestion:* To leverage special Boolean optimizations, rewrite the above to 
{{IS(TRUE|FALSE)\(x)}} or {{x IS [NOT] NULL}} in the planner. (The planner 
appears to already rewrite expressions such as {{TRUE <=> x}} into a canonical 
form so that the rewrite rules need not handle both versions.)

Note: there is no function equivalent of these functions, they are "invisible" 
to the user, but are listed as {{distinctfrom}} and {{notdistinct}} in 
{{impala_functions.py}}.

h4. {{IF(cond, trueExpr, falseExpr)}}

FE: {{SimplifyConditional}} performs basic simplifications.

BE: Implemented in  {{conditional-functions.[h|cc]}} as an interpreted-only 
function to allow short-circuit argument evaluation.

*Suggestion:* Rewrite in the FE to

{code:sql}
CASE WHEN cond THEN trueExpr ELSE falseExpr END
{code}

{{IF()}} will then vanish from the plan so remove the BE implementation.

h4. {{COALESCE(e1, e2, … en)}}

FE: {{SimplifyConditional}} performs basic simplifications.

BE: Implemented in {{conditional-functions.[h|cc]}} as an interpreted-only 
function to allow short-circuit argument evaluation.

*Suggestion:* Rewrite in the FE to

{noformat}
CASE WHEN [ei IS NOT NULL THEN ei]* ELSE en END
{noformat}

When doing so, extend two existing optimizations.

1. Remove not only leading null values, but all null values.
2. Special case not just the last non-null literal, but rather when 
encountering the first such value, drop all remaining terms.

{{COLAESCE()}} will then vanish from the plan so remove the BE implementation. 
Since this step will remove the last of the special conditional functions, 
remove {{conditional-functions.[h|cc]}} as well.

h4. {{DECODE(expr, search1, result1 [, search2, result2 ...] [, default] )}}

FE: {{FunctionCallExpr}}, {{CaseExpr}}: Rewrites {{decode()}} to {{CASE}}. 
{{decode()}} vanishes from the plan after this step.

See the header of {{CaseExpr.java}} for details. Looks like the implementation 
was done before {{IS DISTINCT}} was available:

{quote}
Example of equivalent {{CASE}} for {{DECODE(foo, 'bar', 1, col, 2, NULL, 3, 
4)}}:
{code:sql}
CASE 
    WHEN foo = 'bar' THEN 1   -- no need for IS NULL check
    WHEN foo IS NULL AND col IS NULL OR foo = col THEN 2
    WHEN foo IS NULL THEN 3  -- no need for equality check
    ELSE 4
END
{code}
{quote}

*Nice-to-have:* In FE, modify to use {{<=>}} (AKA {{IS NOT DISTINCT}}):

{code:sql}
CASE [WHEN expr <=> searchi THEN resulti]+ [ELSE default]? END
{code}

Example:

{code:sql}
CASE
    WHEN foo <=> 'bar' THEN 1
    WHEN foo <=> col THEN 2
    WHEN foo <=> NULL THEN 3
    ELSE 4
END
{code}

This expansion (and the original one) evaluates the decode expression multiple 
times and would benefit from the optimization mentioned earlier.

Note also that after IMPALA-6661, {{decode()}} can be used to pick out 
floating-point NaN values:

{code:sql}
decode(float_col, sqrt(-1), 0, float_col)
{code}

Here, {{sqrt(-1)}} is used to create a NaN value because Impala has no {{NaN}} 
constant or function.

*Suggestion:* The current implementation is rather ad-hoc, and may have been 
done early in the Impala lifecycle before the rewrite rules came along. Would 
be cleaner to do the rewrite as rewrite rule rather than as an ad-hoc step when 
creating an expression. That is, rather than doing the rewrite in 
{{FunctionCallExpr.createExpr()}}, do it in 
{{SimplifyConditionalsRule.apply()}}.


was (Author: paul.rogers):
This note summarizes the state of each of the Impala conditional functions to 
identify the path needed to implement the requested changes, and to ensure that 
the changes don't impact other functionality. We also point out a few 
out-of-scope nice-to-haves as we go along.

In general, all the action here is in just a few places:

* {{sql-parser.cup}} in which syntax is reduced to parse nodes such as 
functions or operators. The parser unifies certain constructs such as {{<=>}} 
and {{IS NOT DISTINCT FROM}}.
* {{FunctionCallExpr.createExpr()}} is given a function-like definition and 
converts some of them to other forms ({{decode()}}, {{nvl2(}}, {{nullif()}}. A 
nice-to-have would be to move this logic to 
{{SimplifyConditionalsRule.apply()}} so we have a uniform way of doing 
transforms.
* {{SimplifyConditionalsRule}} does a great many transforms of various 
conditional rules. (We will add more for this task.)
* {{impala_functions.py}} in the BE provides a mapping from remaining functions 
(those not optimized away above) to implementations. All functions listed here 
are cross-compiled into LLVM along with a generated wrapper function that binds 
the function to its set of arguments.
* {{conditional-functions.[h|cc]}} handles special case functions that require 
short-circuit argument evaluation ({{isull()}}, {{if()}}, {{coalesce()}}). 
These three functions are never code generated. The goal of this task is to 
convert these into a code generated for using {{CASE}}.

For all expressions, the planner does a check for all-constant expressions 
(such as {{NULL IS NOT NULL}} or {{(10 = 9) IS TRUE}}) and replaces them with 
the result of the expression by using the BE to interpret the partial 
constant-only expression tree. As a result, the rewrite steps focus on the 
non-trivial cases that require knowledge of the semantics of a given function.

In the suggestions that follow, we rewrite certain functions into {{CASE}}. 
But, in so doing, we end up evaluating certain terms twice. IMPALA-7737 asks to 
resolve that issue.

Below is a summary of each conditional function that identifies current state 
and any changes that might be possible.

h4. {{CASE ...}}

BE: Interpreted when in the {{SELECT}} clause (IMPALA-4356). Code generated 
when in the {{WHERE}} clause or in a join.

h4. {{x IS [NOT] (TRUE | FALSE)}}

FE, {{sql-parser.cup}}: captured as a {{FunctionCallExpr}} for the equivalent 
{{ISTRUE\(x)}}, etc. function.

h4. {{x IS [NOT] NULL}}

FE, {{sql-parser.cup}}: captured as a {{IsNullPredicate}}. (Note that this is 
the opposite of {{IS TRUE}}, etc.)

BE: Cross compiled as a UDF: {{IsNullPredicate::Is[Not]Null}}, with wrapper.

h4. {{IS[NOT](TRUE|FALSE)\(x)}}
 
BE: Implemented in {{ConditionalFunctions::IsTrue}}, etc.

h4. {{NULLIF(expr1, expr2)}} \\ {{NVL2(expr, trueExpr, falseExpr)}}

FE, {{FunctionCallExpr}}: {{nullif(expr1, expr2)}} &rarr; {{if(expr1 IS 
DISTINCT FROM expr2, expr1, NULL)}}

{{NULLIF()}} and {{NVL2()}} vanish from the plan after this step. There is no 
entry for {{nullify()}} in {{impala_functions.py}}.

h4. {{ISNULL(a, b)}}

BE: Alias for this method exist in {{impala_functions.py}}, special 
implementation in {{conditional-functions.[h|cc]}}.

*Suggestion:* Rewrite as:

{code:sql}
CASE a IS NULL THEN b ELSE a END
{code}

Since {{isnull()}} would vanish from the plan after this transform, remove the 
BE implementation.

h4. {{NVL(a, b)}} \\  {{IFNULL(a, b)}}

FE, {{SimplifyConditional}}: Treated same as {{ISNULL(a, b)}}, but is not 
rewritten to this form.

BE: Alias for this method exist in {{impala_functions.py}}.

*Suggestion:* Rewrite to {{ISNULL(a, b)}}, drop from {{impala_functions.py}} to 
make things a bit more tidy. (If the suggestion for {{isnull()}} is taken, then 
even {{isnull()}} vanishes from the plan in the planner.

h4. {{[NON]NULLVALUE\(x)}}

An entry in {{impala_functions.py}} maps this method to the compiled {{IS [NOT] 
NULL}} operator implementations.

*Suggestion:* To make {{impala_functions.py}} less messy, add a transform to 
the FE to replace these functions with the operators, and remove the functions' 
entries from {{impala_functions.py}}. This also ensures that all optimization 
applied to the operators is also done for the functions.

h4. {{x <=> (TRUE | FALSE | NULL)}} \\ {{x IS [NOT] DISTINCT FROM (TRUE | FALSE 
| NULL)}}

FE {{sql-parser.cup}}: Parsed (in generic form) into a 
{{BinaryPredicate(BinaryPredicate.Operator.(NOT_DISTINCT|DISTINCT_FROM)...)}}

BE: Implemented code generated {{Operators::NotDistinct_BooleanVal_BooleanVal}}.

*Suggestion:* To leverage special Boolean optimizations, rewrite the above to 
{{IS(TRUE|FALSE)\(x)}} or {{x IS [NOT] NULL}} in the planner. (The planner 
appears to already rewrite expressions such as {{TRUE <=> x}} into a canonical 
form so that the rewrite rules need not handle both versions.)

Note: there is no function equivalent of these functions, they are "invisible" 
to the user, but are listed as {{distinctfrom}} and {{notdistinct}} in 
{{impala_functions.py}}.

h4. {{IF(cond, trueExpr, falseExpr)}}

FE: {{SimplifyConditional}} performs basic simplifications.

BE: Implemented in  {{conditional-functions.[h|cc]}} as an interpreted-only 
function to allow short-circuit argument evaluation.

*Suggestion:* Rewrite in the FE to

{code:sql}
CASE WHEN cond THEN trueExpr ELSE falseExpr END
{code}

{{IF()}} will then vanish from the plan so remove the BE implementation.

h4. {{COALESCE(e1, e2, … en)}}

FE: {{SimplifyConditional}} performs basic simplifications.

BE: Implemented in {{conditional-functions.[h|cc]}} as an interpreted-only 
function to allow short-circuit argument evaluation.

*Suggestion:* Rewrite in the FE to

{noformat}
CASE WHEN [ei IS NOT NULL THEN ei]* ELSE en END
{noformat}

When doing so, extend two existing optimizations.

1. Remove not only leading null values, but all null values.
2. Special case not just the last non-null literal, but rather when 
encountering the first such value, drop all remaining terms.

{{COLAESCE()}} will then vanish from the plan so remove the BE implementation. 
Since this step will remove the last of the special conditional functions, 
remove {{conditional-functions.[h|cc]}} as well.

> Codegen output for conditional functions (if,isnull, coalesce) is very 
> suboptimal
> ---------------------------------------------------------------------------------
>
>                 Key: IMPALA-7655
>                 URL: https://issues.apache.org/jira/browse/IMPALA-7655
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Backend
>            Reporter: Tim Armstrong
>            Priority: Major
>              Labels: codegen, perf, performance
>
> https://gerrit.cloudera.org/#/c/11565/ provided a clue that an aggregation 
> involving an if() function was very slow, 10x slower than the equivalent 
> version using a case:
> {noformat}
> [localhost:21000] default> set num_nodes=1; set mt_dop=1; select count(case 
> when l_orderkey is NULL then 1 else NULL end) from 
> tpch10_parquet.lineitem;summary;
> NUM_NODES set to 1
> MT_DOP set to 1
> Query: select count(case when l_orderkey is NULL then 1 else NULL end) from 
> tpch10_parquet.lineitem
> Query submitted at: 2018-10-04 11:17:31 (Coordinator: 
> http://tarmstrong-box:25000)
> Query progress can be monitored at: 
> http://tarmstrong-box:25000/query_plan?query_id=274b2a6f35cefe31:95a1964200000000
> +----------------------------------------------------------+
> | count(case when l_orderkey is null then 1 else null end) |
> +----------------------------------------------------------+
> | 0                                                        |
> +----------------------------------------------------------+
> Fetched 1 row(s) in 0.51s
> +--------------+--------+----------+----------+--------+------------+----------+---------------+-------------------------+
> | Operator     | #Hosts | Avg Time | Max Time | #Rows  | Est. #Rows | Peak 
> Mem | Est. Peak Mem | Detail                  |
> +--------------+--------+----------+----------+--------+------------+----------+---------------+-------------------------+
> | 01:AGGREGATE | 1      | 44.03ms  | 44.03ms  | 1      | 1          | 25.00 
> KB | 10.00 MB      | FINALIZE                |
> | 00:SCAN HDFS | 1      | 411.57ms | 411.57ms | 59.99M | -1         | 16.61 
> MB | 88.00 MB      | tpch10_parquet.lineitem |
> +--------------+--------+----------+----------+--------+------------+----------+---------------+-------------------------+
> [localhost:21000] default> set num_nodes=1; set mt_dop=1; select 
> count(if(l_orderkey is NULL, 1, NULL)) from tpch10_parquet.lineitem;summary;
> NUM_NODES set to 1
> MT_DOP set to 1
> Query: select count(if(l_orderkey is NULL, 1, NULL)) from 
> tpch10_parquet.lineitem
> Query submitted at: 2018-10-04 11:23:07 (Coordinator: 
> http://tarmstrong-box:25000)
> Query progress can be monitored at: 
> http://tarmstrong-box:25000/query_plan?query_id=8e46ab1b84c4dbff:2786ca2600000000
> +----------------------------------------+
> | count(if(l_orderkey is null, 1, null)) |
> +----------------------------------------+
> | 0                                      |
> +----------------------------------------+
> Fetched 1 row(s) in 1.01s
> +--------------+--------+----------+----------+--------+------------+----------+---------------+-------------------------+
> | Operator     | #Hosts | Avg Time | Max Time | #Rows  | Est. #Rows | Peak 
> Mem | Est. Peak Mem | Detail                  |
> +--------------+--------+----------+----------+--------+------------+----------+---------------+-------------------------+
> | 01:AGGREGATE | 1      | 422.07ms | 422.07ms | 1      | 1          | 25.00 
> KB | 10.00 MB      | FINALIZE                |
> | 00:SCAN HDFS | 1      | 511.13ms | 511.13ms | 59.99M | -1         | 16.61 
> MB | 88.00 MB      | tpch10_parquet.lineitem |
> +--------------+--------+----------+----------+--------+------------+----------+---------------+-------------------------+
> {noformat}
> It turns out that this is because we don't have good codegen support for 
> ConditionalFunction, and just fall back to emitting a call to the interpreted 
> path: 
> https://github.com/apache/impala/blob/master/be/src/exprs/conditional-functions.cc#L28
> See CaseExpr for an example of much better codegen support: 
> https://github.com/apache/impala/blob/master/be/src/exprs/case-expr.cc#L178



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to