[ 
https://issues.apache.org/jira/browse/IMPALA-14992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18081171#comment-18081171
 ] 

ASF subversion and git services commented on IMPALA-14992:
----------------------------------------------------------

Commit f927e8e77320f09c0e3e3995d674e249d55a2196 in impala's branch 
refs/heads/master from stiga-huang
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=f927e8e77 ]

IMPALA-14992: Deflake test_cancelled_nodes_in_exec_summary

The failed query has 3 scan nodes:
  with l as (
    select * from tpch.lineitem UNION ALL select * from tpch.lineitem
  )
  select STRAIGHT_JOIN count(*) from
    (select * from tpch.lineitem a LIMIT 1) a
  join
    (select * from l LIMIT 125000) b
  on a.l_orderkey = -b.l_orderkey;

The test expects the two scan nodes under UNION have the CANCELLED
marker and the other scan node on "tpch.lineitem a LIMIT 1" can complete
so won't have the CANCELLED marker. However, "LIMIT 1" is also applied
to the ExchangeNode on top of that scan, and the ExchangeNode is the
leaf of the coordinator fragment:

  F01:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1
  ...
  06:EXCHANGE [UNPARTITIONED]
  |  limit: 1
  |  in pipelines: 00(GETNEXT)
  |
  F00:PLAN FRAGMENT [RANDOM] hosts=3 instances=3
  00:SCAN HDFS [tpch.lineitem a, RANDOM]
     HDFS partitions=1/1 files=1 size=718.94MB
     ...
     limit: 1

There are 3 scan node instances for this HdfsScanNode. Any of them
returns a row leads the coordinator to mark the query as finished and
cancels all running fragment instances, which might cancel the other 2
scan node instances. Usually they finish quickly since they have limit=1
so won't be cancelled. The test is flaky when the cluster is busy so
those scan node instances run slow.

This patch deflakes the test by not using "LIMIT 1" on the scan to
ensure it can finish.

Testing
 - Ran the test 1000 times.

Change-Id: I00d0b81051c864f76060ae7ebcafb635da37ab1e
Reviewed-on: http://gerrit.cloudera.org:8080/24306
Reviewed-by: Impala Public Jenkins <[email protected]>
Reviewed-by: Surya Hebbar <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> test_cancelled_nodes_in_exec_summary is flaky
> ---------------------------------------------
>
>                 Key: IMPALA-14992
>                 URL: https://issues.apache.org/jira/browse/IMPALA-14992
>             Project: IMPALA
>          Issue Type: Bug
>            Reporter: Zoltán Borók-Nagy
>            Assignee: Quanlong Huang
>            Priority: Major
>              Labels: broken-build
>
> test_cancelled_nodes_in_exec_summary is flaky, we can see the following in 
> some builds:
> {noformat}
> E   assert 'tpch.lineitem a, CANCELLED' == 'tpch.lineitem a'
> E     - tpch.lineitem a
> E     + tpch.lineitem a, CANCELLED
>         query      = '\n        with l as (select * from tpch.lineitem UNION 
> ALL select * from tpch.lineitem)\n        select STRAIGHT_JOIN...neitem a 
> LIMIT 1) a\n        join\n          (select * from l LIMIT 125000) b\n        
> on a.l_orderkey = -b.l_orderkey'
> {noformat}
> E.g.:
> * 
> https://jenkins.impala.io/job/ubuntu-20.04-dockerised-tests/5098/testReport/junit/query_test.test_observability/TestObservability/test_cancelled_nodes_in_exec_summary/
> * 
> https://jenkins.impala.io/job/ubuntu-20.04-dockerised-tests/5097/testReport/junit/query_test.test_observability/TestObservability/test_cancelled_nodes_in_exec_summary/



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to