[
https://issues.apache.org/jira/browse/IMPALA-14992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18081086#comment-18081086
]
Quanlong Huang commented on IMPALA-14992:
-----------------------------------------
The query is
{code:sql}
with l as (select * from tpch.lineitem UNION ALL select * from tpch.lineitem)
select STRAIGHT_JOIN count(*) from
(select * from tpch.lineitem a LIMIT 1) a
join
(select * from l LIMIT 125000) b
on a.l_orderkey = -b.l_orderkey;{code}
The test expects that scan node instances on "tpch.lineitem a LIMIT 1" should
all finish (return one row). However, the failed test shows that some scan node
instances are cancelled.
Checking the query plan, I realized the coordinator fragment has a LIMIT in the
leaf ExchangeNode:
{noformat}
F01:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1
...
06:EXCHANGE [UNPARTITIONED]
| limit: 1
| in pipelines: 00(GETNEXT)
|
F00:PLAN FRAGMENT [RANDOM] hosts=3 instances=3
00:SCAN HDFS [tpch.lineitem a, RANDOM]
HDFS partitions=1/1 files=1 size=718.94MB
...
limit: 1{noformat}
Due to also having limit_=1, ExchangeNode::GetNext() will set eos to true
whenever any row is returned. This updates the ExecState to RETURNED_RESULTS
then coordinator cancels the fragment instances. If the CancelQueryFInstances
RPC finishes before the remaining scan node instances finish, they are actually
cancelled. This happens in the failed test. Showing CANCELLED for the scan node
is right.
To deflake the issue, we can adjust the query to make 06:EXCHANGE doesn't have
a limit. So 00:SCAN HDFS can always complete.
Uploaded a fix for review: https://gerrit.cloudera.org/c/24306/
> test_cancelled_nodes_in_exec_summary is flaky
> ---------------------------------------------
>
> Key: IMPALA-14992
> URL: https://issues.apache.org/jira/browse/IMPALA-14992
> Project: IMPALA
> Issue Type: Bug
> Reporter: Zoltán Borók-Nagy
> Assignee: Quanlong Huang
> Priority: Major
> Labels: broken-build
>
> test_cancelled_nodes_in_exec_summary is flaky, we can see the following in
> some builds:
> {noformat}
> E assert 'tpch.lineitem a, CANCELLED' == 'tpch.lineitem a'
> E - tpch.lineitem a
> E + tpch.lineitem a, CANCELLED
> query = '\n with l as (select * from tpch.lineitem UNION
> ALL select * from tpch.lineitem)\n select STRAIGHT_JOIN...neitem a
> LIMIT 1) a\n join\n (select * from l LIMIT 125000) b\n
> on a.l_orderkey = -b.l_orderkey'
> {noformat}
> E.g.:
> *
> https://jenkins.impala.io/job/ubuntu-20.04-dockerised-tests/5098/testReport/junit/query_test.test_observability/TestObservability/test_cancelled_nodes_in_exec_summary/
> *
> https://jenkins.impala.io/job/ubuntu-20.04-dockerised-tests/5097/testReport/junit/query_test.test_observability/TestObservability/test_cancelled_nodes_in_exec_summary/
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]