[ 
https://issues.apache.org/jira/browse/IMPALA-10339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17312662#comment-17312662
 ] 

Wenzhe Zhou commented on IMPALA-10339:
--------------------------------------

IMPALA-10259 Tried to fix the race condition when generating status report. 
Apparently the issue still could happen in following scenario.

After all fragment instances are started, QueryState main thread monitor the 
states of fragment instances and periodically send status report to 
coordinator. Assume there are two fragment instances.

1) First fragment instance is completed without error.

2) QueryState main thread send status report with overall_status as OK and with 
one instance as "done".

3) Second fragment instance is stilling running without error.

4) QueryState main thread call ConstructReport() to construct another status 
report. The function set overall_status as OK.

     Second fragment instance fails, set the instance as failed, overall_status 
as the error happened in second instance, the instance state as  "FINISHED".

    QueryState main thread continue to construct status report, add second 
instance as "done", while the current overall_status has error, but overall 
status on the report is OK. 

5) Coordinator receive the status report, reduce the num_remaining_instances, 
and set the backend state as "completed".

6) QueryState main thread send last status report with error in overall_status.

7) Coordinator receive the last status report from that backend, but ignore the 
error in the report since the backend has been marked as "completed". 

 

> Apparent hang or crash in 
> TestSpillingNoDebugActionDimensions.test_spilling_no_debug_action
> -------------------------------------------------------------------------------------------
>
>                 Key: IMPALA-10339
>                 URL: https://issues.apache.org/jira/browse/IMPALA-10339
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>    Affects Versions: Impala 4.0
>            Reporter: Tim Armstrong
>            Assignee: Wenzhe Zhou
>            Priority: Blocker
>              Labels: broken-build, flaky, hang
>
> Release build with this commit as the tip:
> {noformat}
> commit 9400e9b17b13f613defb6d7b9deb471256b1d95c (CDH/cdpd-master-staging)
> Author: wzhou-code <[email protected]>
> Date:   Thu Oct 29 22:32:03 2020 -0700
>     IMPALA-10305: Sync Kudu's FIPS compliant changes
>     
> {noformat}
> {noformat}
> Regression
> query_test.test_spilling.TestSpillingNoDebugActionDimensions.test_spilling_no_debug_action[protocol:
>  beeswax | exec_option: {'mt_dop': 0, 'default_spillable_buffer_size': '64k'} 
> | table_format: parquet/none] (from pytest)
> Failing for the past 1 build (Since Failed#100 )
> Took 1 hr 59 min.
> add description
> Error Message
> query_test/test_spilling.py:134: in test_spilling_no_debug_action     
> self.run_test_case('QueryTest/spilling-no-debug-action', vector) 
> common/impala_test_suite.py:668: in run_test_case     
> self.__verify_exceptions(test_section['CATCH'], str(e), use_db) 
> common/impala_test_suite.py:485: in __verify_exceptions     (expected_str, 
> actual_str) E   AssertionError: Unexpected exception string. Expected: 
> row_regex:.*Cannot perform hash join at node with id .*. Repartitioning did 
> not reduce the size of a spilled partition.* E   Not found in actual: Timeout 
> >7200s
> Stacktrace
> query_test/test_spilling.py:134: in test_spilling_no_debug_action
>     self.run_test_case('QueryTest/spilling-no-debug-action', vector)
> common/impala_test_suite.py:668: in run_test_case
>     self.__verify_exceptions(test_section['CATCH'], str(e), use_db)
> common/impala_test_suite.py:485: in __verify_exceptions
>     (expected_str, actual_str)
> E   AssertionError: Unexpected exception string. Expected: row_regex:.*Cannot 
> perform hash join at node with id .*. Repartitioning did not reduce the size 
> of a spilled partition.*
> E   Not found in actual: Timeout >7200s
> Standard Error
> SET 
> client_identifier=query_test/test_spilling.py::TestSpillingNoDebugActionDimensions::()::test_spilling_no_debug_action[protocol:beeswax|exec_option:{'mt_dop':0;'default_spillable_buffer_size':'64k'}|table_format:parquet/none];
> -- executing against localhost:21000
> use tpch_parquet;
> -- 2020-11-11 23:12:04,319 INFO     MainThread: Started query 
> c740c1c66d9679a9:6a40f16100000000
> SET 
> client_identifier=query_test/test_spilling.py::TestSpillingNoDebugActionDimensions::()::test_spilling_no_debug_action[protocol:beeswax|exec_option:{'mt_dop':0;'default_spillable_buffer_size':'64k'}|table_format:parquet/none];
> SET mt_dop=0;
> SET default_spillable_buffer_size=64k;
> -- 2020-11-11 23:12:04,320 INFO     MainThread: Loading query test file: 
> /data/jenkins/workspace/impala-cdpd-master-staging-exhaustive-release/repos/Impala/testdata/workloads/functional-query/queries/QueryTest/spilling-no-debug-action.test
> -- 2020-11-11 23:12:04,323 INFO     MainThread: Starting new HTTP connection 
> (1): localhost
> -- executing against localhost:21000
> set debug_action="-1:OPEN:[email protected]";
> -- 2020-11-11 23:12:04,377 INFO     MainThread: Started query 
> c044afcf5ae44df9:a2e2e7c600000000
> -- executing against localhost:21000
> select straight_join count(*)
> from
> lineitem a, lineitem b
> where
> a.l_partkey = 1 and
> a.l_orderkey = b.l_orderkey;
> -- 2020-11-11 23:12:04,385 INFO     MainThread: Started query 
> 314c019cd252f322:2411bc7600000000
> -- executing against localhost:21000
> SET DEBUG_ACTION="";
> -- 2020-11-11 23:12:05,199 INFO     MainThread: Started query 
> 80424e68922c30f9:b2144dff00000000
> -- executing against localhost:21000
> set debug_action="-1:OPEN:[email protected]";
> -- 2020-11-11 23:12:05,207 INFO     MainThread: Started query 
> 2a4c1f4b26ea52da:4339f3ff00000000
> -- executing against localhost:21000
> select straight_join count(*)
> from
> lineitem a
> where
> a.l_partkey not in (select l_partkey from lineitem where l_partkey > 10)
> and a.l_partkey < 1000;
> -- 2020-11-11 23:12:05,215 INFO     MainThread: Started query 
> f845afd00a569446:79c5054a00000000
> -- executing against localhost:21000
> SET DEBUG_ACTION="";
> -- 2020-11-11 23:12:07,507 INFO     MainThread: Started query 
> ee4f8a685928e7ef:830d965100000000
> -- executing against localhost:21000
> set debug_action="-1:OPEN:[email protected]";
> -- 2020-11-11 23:12:07,512 INFO     MainThread: Started query 
> 654a6eced9594931:cc68289a00000000
> -- executing against localhost:21000
> select straight_join count(*)
> from
> supplier right outer join lineitem on s_suppkey = l_suppkey
> where s_acctbal > 0 and s_acctbal < 10;
> -- 2020-11-11 23:12:07,519 INFO     MainThread: Started query 
> 7a41a406b446e082:b5e3bf2f00000000
> -- executing against localhost:21000
> SET DEBUG_ACTION="";
> -- 2020-11-11 23:12:08,549 INFO     MainThread: Started query 
> 1445a2833895ee1a:d136681000000000
> -- executing against localhost:21000
> set debug_action="-1:OPEN:[email protected]";
> -- 2020-11-11 23:12:08,554 INFO     MainThread: Started query 
> 4149ef276c643426:a285156400000000
> -- executing against localhost:21000
> select straight_join count(*)
> from
> supplier right outer join lineitem on s_suppkey = l_suppkey
> where s_acctbal > 0 and s_acctbal < 10;
> -- 2020-11-11 23:12:08,562 INFO     MainThread: Started query 
> 58427e99dadb6ca9:0d184f2700000000
> -- executing against localhost:21000
> SET DEBUG_ACTION="";
> -- 2020-11-11 23:12:09,586 INFO     MainThread: Started query 
> 1d498d1d50b86ed3:616ba0e600000000
> -- executing against localhost:21000
> set debug_action="-1:OPEN:[email protected]";
> -- 2020-11-11 23:12:09,592 INFO     MainThread: Started query 
> 4543a32357f9b4cb:e3dbfab800000000
> -- executing against localhost:21000
> with x as (select * from supplier limit 10)
> select straight_join count(*)
> from
> x right anti join lineitem on s_suppkey + 100 = l_suppkey;
> -- 2020-11-11 23:12:09,599 INFO     MainThread: Started query 
> 9e4ba8fd3dbbf2bd:d73094e700000000
> -- executing against localhost:21000
> SET DEBUG_ACTION="";
> -- 2020-11-11 23:12:10,320 INFO     MainThread: Started query 
> 62471da179150bd1:ffa7dd7000000000
> -- executing against localhost:21000
> set mem_limit=75m;
> -- 2020-11-11 23:12:10,326 INFO     MainThread: Started query 
> 8445afe55c819986:0915d3f400000000
> -- executing against localhost:21000
> select l_orderkey, group_concat(repeat(l_comment, 10)) comments
> from lineitem
> group by l_orderkey
> order by comments desc
> limit 5;
> -- 2020-11-11 23:12:10,332 INFO     MainThread: Started query 
> d84ba6a837b98c8d:c7c73fd500000000
> -- executing against localhost:21000
> SET MEM_LIMIT="0";
> -- 2020-11-11 23:12:10,638 INFO     MainThread: Started query 
> 58476b9f3b524901:bed8346900000000
> -- executing against localhost:21000
> set topn_bytes_limit=-1;
> -- 2020-11-11 23:12:10,642 INFO     MainThread: Started query 
> 2742e503cfca610e:a4029e4800000000
> -- executing against localhost:21000
> set mem_limit=100m;
> -- 2020-11-11 23:12:10,648 INFO     MainThread: Started query 
> 9a4f67b8c8a95465:67f8143100000000
> -- executing against localhost:21000
> select *
> from lineitem
> order by l_orderkey desc
> limit 6000000;
> -- 2020-11-11 23:12:10,654 INFO     MainThread: Started query 
> 7147f6f62984fdb2:614cab6a00000000
> -- executing against localhost:21000
> SET TOPN_BYTES_LIMIT="536870912";
> -- 2020-11-11 23:12:10,859 INFO     MainThread: Started query 
> 4642e316a7f90110:c8344b0d00000000
> -- executing against localhost:21000
> SET MEM_LIMIT="0";
> -- 2020-11-11 23:12:10,863 INFO     MainThread: Started query 
> 414724e3a0a7f290:d1a0e69000000000
> -- executing against localhost:21000
> set mem_limit=250m;
> -- 2020-11-11 23:12:10,867 INFO     MainThread: Started query 
> 854c0c0d1569c7f5:3635e5ae00000000
> -- executing against localhost:21000
> select straight_join *
> from supplier join /* +broadcast */ lineitem on s_suppkey = l_linenumber
> order by l_tax desc
> limit 5;
> -- 2020-11-11 23:12:10,873 INFO     MainThread: Started query 
> de444e75ef44ee0b:a1614bfe00000000
> ~~~~~~~~~~~~~~~~~~~~~ Stack of <unknown> (140237730514688) 
> ~~~~~~~~~~~~~~~~~~~~~
>   File 
> "/data/jenkins/workspace/impala-cdpd-master-staging-exhaustive-release/repos/Impala/infra/python/env-gcc7.5.0/lib/python2.7/site-packages/execnet/gateway_base.py",
>  line 277, in _perform_spawn
>     reply.run()
>   File 
> "/data/jenkins/workspace/impala-cdpd-master-staging-exhaustive-release/repos/Impala/infra/python/env-gcc7.5.0/lib/python2.7/site-packages/execnet/gateway_base.py",
>  line 213, in run
>     self._result = func(*args, **kwargs)
>   File 
> "/data/jenkins/workspace/impala-cdpd-master-staging-exhaustive-release/repos/Impala/infra/python/env-gcc7.5.0/lib/python2.7/site-packages/execnet/gateway_base.py",
>  line 954, in _thread_receiver
>     msg = Message.from_io(io)
>   File 
> "/data/jenkins/workspace/impala-cdpd-master-staging-exhaustive-release/repos/Impala/infra/python/env-gcc7.5.0/lib/python2.7/site-packages/execnet/gateway_base.py",
>  line 418, in from_io
>     header = io.read(9)  # type 1, channel 4, payload 4
>   File 
> "/data/jenkins/workspace/impala-cdpd-master-staging-exhaustive-release/repos/Impala/infra/python/env-gcc7.5.0/lib/python2.7/site-packages/execnet/gateway_base.py",
>  line 386, in read
>     data = self._read(numbytes-len(buf))
> -- executing against localhost:21000
> SET MEM_LIMIT="0";
> -- 2020-11-12 01:12:03,717 INFO     MainThread: Started query 
> 8d4eac3249e55996:2b2053a700000000
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to