[
https://issues.apache.org/jira/browse/IMPALA-7655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16657532#comment-16657532
]
Paul Rogers edited comment on IMPALA-7655 at 10/29/18 5:34 PM:
---------------------------------------------------------------
Work on this ticket ran into a number of known and new bugs in the expression
rewriter. All of those issues are out of scope for this fix. As a result, this
fix will simply rewrite the three functions in question to a {{CASE}}
statement. {{if()}} and {{isnull()}} are rewritten directly to a {{CASE}}
statement which other rewrite rules simplify (or not, depending on the other
bugs.)
{{coalesce()}} retains its existing simplification code, which is extended to
emit a {{CASE}} statement rather than a simplified {{coalesce()}} call. When
doing so, extended the simplification with two additional optimizations.
# Remove not only leading null values, but all null values.
# Special case not just the last non-null literal, but rather when encountering
the first such value, drop all remaining terms.
The existing BE implementation is retained because bugs and limitations mean
that there will still be paths in which the interpreted versions are called:
# If the user disables rewrites.
# If the functions occur in the {{ORDER BY}} clause.
was (Author: paul.rogers):
Work on this ticket required a broad review of conditional functions as
summarized in IMPALA-7747. The notes below focus on the functions covered in
this ticket.
h4. {{ISNULL(a, b)}}
BE: Alias for this method exist in {{impala_functions.py}}, special
implementation in {{conditional-functions.[h|cc]}}.
*Suggestion:* Rewrite as:
{code:sql}
CASE a IS NULL THEN b ELSE a END
{code}
Since {{isnull()}} would vanish from the plan after this transform, remove the
BE implementation. Ensure that the entry in {{impala_functions.py}} remains so
that the function appears in the list of built-in functions.
h4. {{NVL(a, b)}} \\ {{IFNULL(a, b)}}
FE, {{SimplifyConditional}}: Treated same as {{ISNULL(a, b)}}, but is not
rewritten to this form.
BE: Alias for this method exist in {{impala_functions.py}}.
*Suggestion:* Rewrite to {{ISNULL(a, b)}}, to make things a bit more tidy.
h4. {{IF(cond, trueExpr, falseExpr)}}
FE: {{SimplifyConditional}} performs basic simplifications.
BE: Implemented in {{conditional-functions.[h|cc]}} as an interpreted-only
function to allow short-circuit argument evaluation.
*Suggestion:* Rewrite in the FE to
{code:sql}
CASE WHEN cond THEN trueExpr ELSE falseExpr END
{code}
{{IF()}} will then vanish from the plan so remove the BE implementation,
leaving the entry in {{impala_functions.py}}.
h4. {{COALESCE(e1, e2, … en)}}
FE: {{SimplifyConditional}} performs basic simplifications.
BE: Implemented in {{conditional-functions.[h|cc]}} as an interpreted-only
function to allow short-circuit argument evaluation.
*Suggestion:* Rewrite in the FE to
{noformat}
CASE WHEN [ei IS NOT NULL THEN ei]* ELSE en END
{noformat}
When doing so, extend two existing optimizations.
1. Remove not only leading null values, but all null values.
2. Special case not just the last non-null literal, but rather when
encountering the first such value, drop all remaining terms.
{{COLAESCE()}} will then vanish from the plan so remove the BE implementation.
h4. Remove {{conditional-functions.[h|cc]}}
Since the above will remove the the three special conditional functions, remove
{{conditional-functions.[h|cc]}} as well.
> Codegen output for conditional functions (if,isnull, coalesce) is very
> suboptimal
> ---------------------------------------------------------------------------------
>
> Key: IMPALA-7655
> URL: https://issues.apache.org/jira/browse/IMPALA-7655
> Project: IMPALA
> Issue Type: Improvement
> Components: Backend
> Reporter: Tim Armstrong
> Assignee: Paul Rogers
> Priority: Major
> Labels: codegen, perf, performance
>
> https://gerrit.cloudera.org/#/c/11565/ provided a clue that an aggregation
> involving an if() function was very slow, 10x slower than the equivalent
> version using a case:
> {noformat}
> [localhost:21000] default> set num_nodes=1; set mt_dop=1; select count(case
> when l_orderkey is NULL then 1 else NULL end) from
> tpch10_parquet.lineitem;summary;
> NUM_NODES set to 1
> MT_DOP set to 1
> Query: select count(case when l_orderkey is NULL then 1 else NULL end) from
> tpch10_parquet.lineitem
> Query submitted at: 2018-10-04 11:17:31 (Coordinator:
> http://tarmstrong-box:25000)
> Query progress can be monitored at:
> http://tarmstrong-box:25000/query_plan?query_id=274b2a6f35cefe31:95a1964200000000
> +----------------------------------------------------------+
> | count(case when l_orderkey is null then 1 else null end) |
> +----------------------------------------------------------+
> | 0 |
> +----------------------------------------------------------+
> Fetched 1 row(s) in 0.51s
> +--------------+--------+----------+----------+--------+------------+----------+---------------+-------------------------+
> | Operator | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak
> Mem | Est. Peak Mem | Detail |
> +--------------+--------+----------+----------+--------+------------+----------+---------------+-------------------------+
> | 01:AGGREGATE | 1 | 44.03ms | 44.03ms | 1 | 1 | 25.00
> KB | 10.00 MB | FINALIZE |
> | 00:SCAN HDFS | 1 | 411.57ms | 411.57ms | 59.99M | -1 | 16.61
> MB | 88.00 MB | tpch10_parquet.lineitem |
> +--------------+--------+----------+----------+--------+------------+----------+---------------+-------------------------+
> [localhost:21000] default> set num_nodes=1; set mt_dop=1; select
> count(if(l_orderkey is NULL, 1, NULL)) from tpch10_parquet.lineitem;summary;
> NUM_NODES set to 1
> MT_DOP set to 1
> Query: select count(if(l_orderkey is NULL, 1, NULL)) from
> tpch10_parquet.lineitem
> Query submitted at: 2018-10-04 11:23:07 (Coordinator:
> http://tarmstrong-box:25000)
> Query progress can be monitored at:
> http://tarmstrong-box:25000/query_plan?query_id=8e46ab1b84c4dbff:2786ca2600000000
> +----------------------------------------+
> | count(if(l_orderkey is null, 1, null)) |
> +----------------------------------------+
> | 0 |
> +----------------------------------------+
> Fetched 1 row(s) in 1.01s
> +--------------+--------+----------+----------+--------+------------+----------+---------------+-------------------------+
> | Operator | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak
> Mem | Est. Peak Mem | Detail |
> +--------------+--------+----------+----------+--------+------------+----------+---------------+-------------------------+
> | 01:AGGREGATE | 1 | 422.07ms | 422.07ms | 1 | 1 | 25.00
> KB | 10.00 MB | FINALIZE |
> | 00:SCAN HDFS | 1 | 511.13ms | 511.13ms | 59.99M | -1 | 16.61
> MB | 88.00 MB | tpch10_parquet.lineitem |
> +--------------+--------+----------+----------+--------+------------+----------+---------------+-------------------------+
> {noformat}
> It turns out that this is because we don't have good codegen support for
> ConditionalFunction, and just fall back to emitting a call to the interpreted
> path:
> https://github.com/apache/impala/blob/master/be/src/exprs/conditional-functions.cc#L28
> See CaseExpr for an example of much better codegen support:
> https://github.com/apache/impala/blob/master/be/src/exprs/case-expr.cc#L178
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]