[ 
https://issues.apache.org/jira/browse/IMPALA-7655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16657532#comment-16657532
 ] 

Paul Rogers edited comment on IMPALA-7655 at 10/29/18 5:34 PM:
---------------------------------------------------------------

Work on this ticket ran into a number of known and new bugs in the expression 
rewriter. All of those issues are out of scope for this fix. As a result, this 
fix will simply rewrite the three functions in question to a {{CASE}} 
statement. {{if()}} and {{isnull()}} are rewritten directly to a {{CASE}} 
statement which other rewrite rules simplify (or not, depending on the other 
bugs.)

{{coalesce()}} retains its existing simplification code, which is extended to 
emit a {{CASE}} statement rather than a simplified {{coalesce()}} call. When 
doing so, extended the simplification with two additional optimizations.

# Remove not only leading null values, but all null values.
# Special case not just the last non-null literal, but rather when encountering 
the first such value, drop all remaining terms.

The existing BE implementation is retained because bugs and limitations mean 
that there will still be paths in which the interpreted versions are called:

# If the user disables rewrites.
# If the functions occur in the {{ORDER BY}} clause.


was (Author: paul.rogers):
Work on this ticket required a broad review of conditional functions as 
summarized in IMPALA-7747. The notes below focus on the functions covered in 
this ticket.

h4. {{ISNULL(a, b)}}

BE: Alias for this method exist in {{impala_functions.py}}, special 
implementation in {{conditional-functions.[h|cc]}}.

*Suggestion:* Rewrite as:

{code:sql}
CASE a IS NULL THEN b ELSE a END
{code}

Since {{isnull()}} would vanish from the plan after this transform, remove the 
BE implementation. Ensure that the entry in {{impala_functions.py}} remains so 
that the function appears in the list of  built-in functions.

h4. {{NVL(a, b)}} \\ {{IFNULL(a, b)}}

FE, {{SimplifyConditional}}: Treated same as {{ISNULL(a, b)}}, but is not 
rewritten to this form.

BE: Alias for this method exist in {{impala_functions.py}}.

*Suggestion:* Rewrite to {{ISNULL(a, b)}}, to make things a bit more tidy.

h4. {{IF(cond, trueExpr, falseExpr)}}

FE: {{SimplifyConditional}} performs basic simplifications.

BE: Implemented in  {{conditional-functions.[h|cc]}} as an interpreted-only 
function to allow short-circuit argument evaluation.

*Suggestion:* Rewrite in the FE to

{code:sql}
CASE WHEN cond THEN trueExpr ELSE falseExpr END
{code}

{{IF()}} will then vanish from the plan so remove the BE implementation, 
leaving the entry in {{impala_functions.py}}.

h4. {{COALESCE(e1, e2, … en)}}

FE: {{SimplifyConditional}} performs basic simplifications.

BE: Implemented in {{conditional-functions.[h|cc]}} as an interpreted-only 
function to allow short-circuit argument evaluation.

*Suggestion:* Rewrite in the FE to

{noformat}
CASE WHEN [ei IS NOT NULL THEN ei]* ELSE en END
{noformat}

When doing so, extend two existing optimizations.

1. Remove not only leading null values, but all null values.
2. Special case not just the last non-null literal, but rather when 
encountering the first such value, drop all remaining terms.

{{COLAESCE()}} will then vanish from the plan so remove the BE implementation.

h4. Remove {{conditional-functions.[h|cc]}}

Since the above will remove the the three special conditional functions, remove 
{{conditional-functions.[h|cc]}} as well.

> Codegen output for conditional functions (if,isnull, coalesce) is very 
> suboptimal
> ---------------------------------------------------------------------------------
>
>                 Key: IMPALA-7655
>                 URL: https://issues.apache.org/jira/browse/IMPALA-7655
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Backend
>            Reporter: Tim Armstrong
>            Assignee: Paul Rogers
>            Priority: Major
>              Labels: codegen, perf, performance
>
> https://gerrit.cloudera.org/#/c/11565/ provided a clue that an aggregation 
> involving an if() function was very slow, 10x slower than the equivalent 
> version using a case:
> {noformat}
> [localhost:21000] default> set num_nodes=1; set mt_dop=1; select count(case 
> when l_orderkey is NULL then 1 else NULL end) from 
> tpch10_parquet.lineitem;summary;
> NUM_NODES set to 1
> MT_DOP set to 1
> Query: select count(case when l_orderkey is NULL then 1 else NULL end) from 
> tpch10_parquet.lineitem
> Query submitted at: 2018-10-04 11:17:31 (Coordinator: 
> http://tarmstrong-box:25000)
> Query progress can be monitored at: 
> http://tarmstrong-box:25000/query_plan?query_id=274b2a6f35cefe31:95a1964200000000
> +----------------------------------------------------------+
> | count(case when l_orderkey is null then 1 else null end) |
> +----------------------------------------------------------+
> | 0                                                        |
> +----------------------------------------------------------+
> Fetched 1 row(s) in 0.51s
> +--------------+--------+----------+----------+--------+------------+----------+---------------+-------------------------+
> | Operator     | #Hosts | Avg Time | Max Time | #Rows  | Est. #Rows | Peak 
> Mem | Est. Peak Mem | Detail                  |
> +--------------+--------+----------+----------+--------+------------+----------+---------------+-------------------------+
> | 01:AGGREGATE | 1      | 44.03ms  | 44.03ms  | 1      | 1          | 25.00 
> KB | 10.00 MB      | FINALIZE                |
> | 00:SCAN HDFS | 1      | 411.57ms | 411.57ms | 59.99M | -1         | 16.61 
> MB | 88.00 MB      | tpch10_parquet.lineitem |
> +--------------+--------+----------+----------+--------+------------+----------+---------------+-------------------------+
> [localhost:21000] default> set num_nodes=1; set mt_dop=1; select 
> count(if(l_orderkey is NULL, 1, NULL)) from tpch10_parquet.lineitem;summary;
> NUM_NODES set to 1
> MT_DOP set to 1
> Query: select count(if(l_orderkey is NULL, 1, NULL)) from 
> tpch10_parquet.lineitem
> Query submitted at: 2018-10-04 11:23:07 (Coordinator: 
> http://tarmstrong-box:25000)
> Query progress can be monitored at: 
> http://tarmstrong-box:25000/query_plan?query_id=8e46ab1b84c4dbff:2786ca2600000000
> +----------------------------------------+
> | count(if(l_orderkey is null, 1, null)) |
> +----------------------------------------+
> | 0                                      |
> +----------------------------------------+
> Fetched 1 row(s) in 1.01s
> +--------------+--------+----------+----------+--------+------------+----------+---------------+-------------------------+
> | Operator     | #Hosts | Avg Time | Max Time | #Rows  | Est. #Rows | Peak 
> Mem | Est. Peak Mem | Detail                  |
> +--------------+--------+----------+----------+--------+------------+----------+---------------+-------------------------+
> | 01:AGGREGATE | 1      | 422.07ms | 422.07ms | 1      | 1          | 25.00 
> KB | 10.00 MB      | FINALIZE                |
> | 00:SCAN HDFS | 1      | 511.13ms | 511.13ms | 59.99M | -1         | 16.61 
> MB | 88.00 MB      | tpch10_parquet.lineitem |
> +--------------+--------+----------+----------+--------+------------+----------+---------------+-------------------------+
> {noformat}
> It turns out that this is because we don't have good codegen support for 
> ConditionalFunction, and just fall back to emitting a call to the interpreted 
> path: 
> https://github.com/apache/impala/blob/master/be/src/exprs/conditional-functions.cc#L28
> See CaseExpr for an example of much better codegen support: 
> https://github.com/apache/impala/blob/master/be/src/exprs/case-expr.cc#L178



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to