[
https://issues.apache.org/jira/browse/IMPALA-7655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16656008#comment-16656008
]
Paul Rogers commented on IMPALA-7655:
-------------------------------------
Turns out this is a complex topic: the FE has a number of transforms that
affect specific conditional statements. The BE has additional transforms. The
lesson seems to be many point optimizations have been done. To proceed, each
operation must be tracked through the system one by one.
The first to trace is that, despite [~tarmstrong]'s description, and the code
that exists in {{case-expr.cc}}, a quick sample query suggests that Impala does
not, in fact, perform code generation for {{CASE}} statements (at least not
those against a string column). So, the first question to resolve is why this
is not happening. Then, we can look at leveraging {{CASE}} for the {{if()}} and
other improvements.
In particular, start a 1-node cluster with code-ten logging:
{code}
start-impala-cluster.py -s 1 --impalad_args -dump_ir
{code}
In the Impala shell, create a simple table:
{code:sql}
create database test;
use test;
create table bools (b boolean);
insert into bools values (true), (false), (null);
{code}
Then, do the following query:
{code:sql}
select case when b is null then 1 else 2 end FROM bools;
{code}
According to {{case-expr.cc}}, this should generate an LLVM function called
{{CaseExpr}} (with the usual type decorations.) Look in the log. Although
{{impala::IsNullPredicate::IsNull}} is generated, no {{CaseExpr}} function is
generated.
Now, do the following command:
{code:sql}
select if(b is null, 1, 2) FROM bools;
{code}
Again review the codegen logs. The code gen is very similar. Until I get more
info, the suspicion is that neither {{CASE}} nor {{IF()}} are causing codegen.
> Codegen output for conditional functions (if,isnull, coalesce) is very
> suboptimal
> ---------------------------------------------------------------------------------
>
> Key: IMPALA-7655
> URL: https://issues.apache.org/jira/browse/IMPALA-7655
> Project: IMPALA
> Issue Type: Improvement
> Components: Backend
> Reporter: Tim Armstrong
> Priority: Major
> Labels: codegen, perf, performance
>
> https://gerrit.cloudera.org/#/c/11565/ provided a clue that an aggregation
> involving an if() function was very slow, 10x slower than the equivalent
> version using a case:
> {noformat}
> [localhost:21000] default> set num_nodes=1; set mt_dop=1; select count(case
> when l_orderkey is NULL then 1 else NULL end) from
> tpch10_parquet.lineitem;summary;
> NUM_NODES set to 1
> MT_DOP set to 1
> Query: select count(case when l_orderkey is NULL then 1 else NULL end) from
> tpch10_parquet.lineitem
> Query submitted at: 2018-10-04 11:17:31 (Coordinator:
> http://tarmstrong-box:25000)
> Query progress can be monitored at:
> http://tarmstrong-box:25000/query_plan?query_id=274b2a6f35cefe31:95a1964200000000
> +----------------------------------------------------------+
> | count(case when l_orderkey is null then 1 else null end) |
> +----------------------------------------------------------+
> | 0 |
> +----------------------------------------------------------+
> Fetched 1 row(s) in 0.51s
> +--------------+--------+----------+----------+--------+------------+----------+---------------+-------------------------+
> | Operator | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak
> Mem | Est. Peak Mem | Detail |
> +--------------+--------+----------+----------+--------+------------+----------+---------------+-------------------------+
> | 01:AGGREGATE | 1 | 44.03ms | 44.03ms | 1 | 1 | 25.00
> KB | 10.00 MB | FINALIZE |
> | 00:SCAN HDFS | 1 | 411.57ms | 411.57ms | 59.99M | -1 | 16.61
> MB | 88.00 MB | tpch10_parquet.lineitem |
> +--------------+--------+----------+----------+--------+------------+----------+---------------+-------------------------+
> [localhost:21000] default> set num_nodes=1; set mt_dop=1; select
> count(if(l_orderkey is NULL, 1, NULL)) from tpch10_parquet.lineitem;summary;
> NUM_NODES set to 1
> MT_DOP set to 1
> Query: select count(if(l_orderkey is NULL, 1, NULL)) from
> tpch10_parquet.lineitem
> Query submitted at: 2018-10-04 11:23:07 (Coordinator:
> http://tarmstrong-box:25000)
> Query progress can be monitored at:
> http://tarmstrong-box:25000/query_plan?query_id=8e46ab1b84c4dbff:2786ca2600000000
> +----------------------------------------+
> | count(if(l_orderkey is null, 1, null)) |
> +----------------------------------------+
> | 0 |
> +----------------------------------------+
> Fetched 1 row(s) in 1.01s
> +--------------+--------+----------+----------+--------+------------+----------+---------------+-------------------------+
> | Operator | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak
> Mem | Est. Peak Mem | Detail |
> +--------------+--------+----------+----------+--------+------------+----------+---------------+-------------------------+
> | 01:AGGREGATE | 1 | 422.07ms | 422.07ms | 1 | 1 | 25.00
> KB | 10.00 MB | FINALIZE |
> | 00:SCAN HDFS | 1 | 511.13ms | 511.13ms | 59.99M | -1 | 16.61
> MB | 88.00 MB | tpch10_parquet.lineitem |
> +--------------+--------+----------+----------+--------+------------+----------+---------------+-------------------------+
> {noformat}
> It turns out that this is because we don't have good codegen support for
> ConditionalFunction, and just fall back to emitting a call to the interpreted
> path:
> https://github.com/apache/impala/blob/master/be/src/exprs/conditional-functions.cc#L28
> See CaseExpr for an example of much better codegen support:
> https://github.com/apache/impala/blob/master/be/src/exprs/case-expr.cc#L178
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]