[
https://issues.apache.org/jira/browse/IMPALA-10339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17312662#comment-17312662
]
Wenzhe Zhou commented on IMPALA-10339:
--------------------------------------
IMPALA-10259 Tried to fix the race condition when generating status report.
Apparently the issue still could happen in following scenario.
After all fragment instances are started, QueryState main thread monitor the
states of fragment instances and periodically send status report to
coordinator. Assume there are two fragment instances.
1) First fragment instance is completed without error.
2) QueryState main thread send status report with overall_status as OK and with
one instance as "done".
3) Second fragment instance is stilling running without error.
4) QueryState main thread call ConstructReport() to construct another status
report. The function set overall_status as OK.
Second fragment instance fails, set the instance as failed, overall_status
as the error happened in second instance, the instance state as "FINISHED".
QueryState main thread continue to construct status report, add second
instance as "done", while the current overall_status has error, but overall
status on the report is OK.
5) Coordinator receive the status report, reduce the num_remaining_instances,
and set the backend state as "completed".
6) QueryState main thread send last status report with error in overall_status.
7) Coordinator receive the last status report from that backend, but ignore the
error in the report since the backend has been marked as "completed".
> Apparent hang or crash in
> TestSpillingNoDebugActionDimensions.test_spilling_no_debug_action
> -------------------------------------------------------------------------------------------
>
> Key: IMPALA-10339
> URL: https://issues.apache.org/jira/browse/IMPALA-10339
> Project: IMPALA
> Issue Type: Bug
> Components: Backend
> Affects Versions: Impala 4.0
> Reporter: Tim Armstrong
> Assignee: Wenzhe Zhou
> Priority: Blocker
> Labels: broken-build, flaky, hang
>
> Release build with this commit as the tip:
> {noformat}
> commit 9400e9b17b13f613defb6d7b9deb471256b1d95c (CDH/cdpd-master-staging)
> Author: wzhou-code <[email protected]>
> Date: Thu Oct 29 22:32:03 2020 -0700
> IMPALA-10305: Sync Kudu's FIPS compliant changes
>
> {noformat}
> {noformat}
> Regression
> query_test.test_spilling.TestSpillingNoDebugActionDimensions.test_spilling_no_debug_action[protocol:
> beeswax | exec_option: {'mt_dop': 0, 'default_spillable_buffer_size': '64k'}
> | table_format: parquet/none] (from pytest)
> Failing for the past 1 build (Since Failed#100 )
> Took 1 hr 59 min.
> add description
> Error Message
> query_test/test_spilling.py:134: in test_spilling_no_debug_action
> self.run_test_case('QueryTest/spilling-no-debug-action', vector)
> common/impala_test_suite.py:668: in run_test_case
> self.__verify_exceptions(test_section['CATCH'], str(e), use_db)
> common/impala_test_suite.py:485: in __verify_exceptions (expected_str,
> actual_str) E AssertionError: Unexpected exception string. Expected:
> row_regex:.*Cannot perform hash join at node with id .*. Repartitioning did
> not reduce the size of a spilled partition.* E Not found in actual: Timeout
> >7200s
> Stacktrace
> query_test/test_spilling.py:134: in test_spilling_no_debug_action
> self.run_test_case('QueryTest/spilling-no-debug-action', vector)
> common/impala_test_suite.py:668: in run_test_case
> self.__verify_exceptions(test_section['CATCH'], str(e), use_db)
> common/impala_test_suite.py:485: in __verify_exceptions
> (expected_str, actual_str)
> E AssertionError: Unexpected exception string. Expected: row_regex:.*Cannot
> perform hash join at node with id .*. Repartitioning did not reduce the size
> of a spilled partition.*
> E Not found in actual: Timeout >7200s
> Standard Error
> SET
> client_identifier=query_test/test_spilling.py::TestSpillingNoDebugActionDimensions::()::test_spilling_no_debug_action[protocol:beeswax|exec_option:{'mt_dop':0;'default_spillable_buffer_size':'64k'}|table_format:parquet/none];
> -- executing against localhost:21000
> use tpch_parquet;
> -- 2020-11-11 23:12:04,319 INFO MainThread: Started query
> c740c1c66d9679a9:6a40f16100000000
> SET
> client_identifier=query_test/test_spilling.py::TestSpillingNoDebugActionDimensions::()::test_spilling_no_debug_action[protocol:beeswax|exec_option:{'mt_dop':0;'default_spillable_buffer_size':'64k'}|table_format:parquet/none];
> SET mt_dop=0;
> SET default_spillable_buffer_size=64k;
> -- 2020-11-11 23:12:04,320 INFO MainThread: Loading query test file:
> /data/jenkins/workspace/impala-cdpd-master-staging-exhaustive-release/repos/Impala/testdata/workloads/functional-query/queries/QueryTest/spilling-no-debug-action.test
> -- 2020-11-11 23:12:04,323 INFO MainThread: Starting new HTTP connection
> (1): localhost
> -- executing against localhost:21000
> set debug_action="-1:OPEN:[email protected]";
> -- 2020-11-11 23:12:04,377 INFO MainThread: Started query
> c044afcf5ae44df9:a2e2e7c600000000
> -- executing against localhost:21000
> select straight_join count(*)
> from
> lineitem a, lineitem b
> where
> a.l_partkey = 1 and
> a.l_orderkey = b.l_orderkey;
> -- 2020-11-11 23:12:04,385 INFO MainThread: Started query
> 314c019cd252f322:2411bc7600000000
> -- executing against localhost:21000
> SET DEBUG_ACTION="";
> -- 2020-11-11 23:12:05,199 INFO MainThread: Started query
> 80424e68922c30f9:b2144dff00000000
> -- executing against localhost:21000
> set debug_action="-1:OPEN:[email protected]";
> -- 2020-11-11 23:12:05,207 INFO MainThread: Started query
> 2a4c1f4b26ea52da:4339f3ff00000000
> -- executing against localhost:21000
> select straight_join count(*)
> from
> lineitem a
> where
> a.l_partkey not in (select l_partkey from lineitem where l_partkey > 10)
> and a.l_partkey < 1000;
> -- 2020-11-11 23:12:05,215 INFO MainThread: Started query
> f845afd00a569446:79c5054a00000000
> -- executing against localhost:21000
> SET DEBUG_ACTION="";
> -- 2020-11-11 23:12:07,507 INFO MainThread: Started query
> ee4f8a685928e7ef:830d965100000000
> -- executing against localhost:21000
> set debug_action="-1:OPEN:[email protected]";
> -- 2020-11-11 23:12:07,512 INFO MainThread: Started query
> 654a6eced9594931:cc68289a00000000
> -- executing against localhost:21000
> select straight_join count(*)
> from
> supplier right outer join lineitem on s_suppkey = l_suppkey
> where s_acctbal > 0 and s_acctbal < 10;
> -- 2020-11-11 23:12:07,519 INFO MainThread: Started query
> 7a41a406b446e082:b5e3bf2f00000000
> -- executing against localhost:21000
> SET DEBUG_ACTION="";
> -- 2020-11-11 23:12:08,549 INFO MainThread: Started query
> 1445a2833895ee1a:d136681000000000
> -- executing against localhost:21000
> set debug_action="-1:OPEN:[email protected]";
> -- 2020-11-11 23:12:08,554 INFO MainThread: Started query
> 4149ef276c643426:a285156400000000
> -- executing against localhost:21000
> select straight_join count(*)
> from
> supplier right outer join lineitem on s_suppkey = l_suppkey
> where s_acctbal > 0 and s_acctbal < 10;
> -- 2020-11-11 23:12:08,562 INFO MainThread: Started query
> 58427e99dadb6ca9:0d184f2700000000
> -- executing against localhost:21000
> SET DEBUG_ACTION="";
> -- 2020-11-11 23:12:09,586 INFO MainThread: Started query
> 1d498d1d50b86ed3:616ba0e600000000
> -- executing against localhost:21000
> set debug_action="-1:OPEN:[email protected]";
> -- 2020-11-11 23:12:09,592 INFO MainThread: Started query
> 4543a32357f9b4cb:e3dbfab800000000
> -- executing against localhost:21000
> with x as (select * from supplier limit 10)
> select straight_join count(*)
> from
> x right anti join lineitem on s_suppkey + 100 = l_suppkey;
> -- 2020-11-11 23:12:09,599 INFO MainThread: Started query
> 9e4ba8fd3dbbf2bd:d73094e700000000
> -- executing against localhost:21000
> SET DEBUG_ACTION="";
> -- 2020-11-11 23:12:10,320 INFO MainThread: Started query
> 62471da179150bd1:ffa7dd7000000000
> -- executing against localhost:21000
> set mem_limit=75m;
> -- 2020-11-11 23:12:10,326 INFO MainThread: Started query
> 8445afe55c819986:0915d3f400000000
> -- executing against localhost:21000
> select l_orderkey, group_concat(repeat(l_comment, 10)) comments
> from lineitem
> group by l_orderkey
> order by comments desc
> limit 5;
> -- 2020-11-11 23:12:10,332 INFO MainThread: Started query
> d84ba6a837b98c8d:c7c73fd500000000
> -- executing against localhost:21000
> SET MEM_LIMIT="0";
> -- 2020-11-11 23:12:10,638 INFO MainThread: Started query
> 58476b9f3b524901:bed8346900000000
> -- executing against localhost:21000
> set topn_bytes_limit=-1;
> -- 2020-11-11 23:12:10,642 INFO MainThread: Started query
> 2742e503cfca610e:a4029e4800000000
> -- executing against localhost:21000
> set mem_limit=100m;
> -- 2020-11-11 23:12:10,648 INFO MainThread: Started query
> 9a4f67b8c8a95465:67f8143100000000
> -- executing against localhost:21000
> select *
> from lineitem
> order by l_orderkey desc
> limit 6000000;
> -- 2020-11-11 23:12:10,654 INFO MainThread: Started query
> 7147f6f62984fdb2:614cab6a00000000
> -- executing against localhost:21000
> SET TOPN_BYTES_LIMIT="536870912";
> -- 2020-11-11 23:12:10,859 INFO MainThread: Started query
> 4642e316a7f90110:c8344b0d00000000
> -- executing against localhost:21000
> SET MEM_LIMIT="0";
> -- 2020-11-11 23:12:10,863 INFO MainThread: Started query
> 414724e3a0a7f290:d1a0e69000000000
> -- executing against localhost:21000
> set mem_limit=250m;
> -- 2020-11-11 23:12:10,867 INFO MainThread: Started query
> 854c0c0d1569c7f5:3635e5ae00000000
> -- executing against localhost:21000
> select straight_join *
> from supplier join /* +broadcast */ lineitem on s_suppkey = l_linenumber
> order by l_tax desc
> limit 5;
> -- 2020-11-11 23:12:10,873 INFO MainThread: Started query
> de444e75ef44ee0b:a1614bfe00000000
> ~~~~~~~~~~~~~~~~~~~~~ Stack of <unknown> (140237730514688)
> ~~~~~~~~~~~~~~~~~~~~~
> File
> "/data/jenkins/workspace/impala-cdpd-master-staging-exhaustive-release/repos/Impala/infra/python/env-gcc7.5.0/lib/python2.7/site-packages/execnet/gateway_base.py",
> line 277, in _perform_spawn
> reply.run()
> File
> "/data/jenkins/workspace/impala-cdpd-master-staging-exhaustive-release/repos/Impala/infra/python/env-gcc7.5.0/lib/python2.7/site-packages/execnet/gateway_base.py",
> line 213, in run
> self._result = func(*args, **kwargs)
> File
> "/data/jenkins/workspace/impala-cdpd-master-staging-exhaustive-release/repos/Impala/infra/python/env-gcc7.5.0/lib/python2.7/site-packages/execnet/gateway_base.py",
> line 954, in _thread_receiver
> msg = Message.from_io(io)
> File
> "/data/jenkins/workspace/impala-cdpd-master-staging-exhaustive-release/repos/Impala/infra/python/env-gcc7.5.0/lib/python2.7/site-packages/execnet/gateway_base.py",
> line 418, in from_io
> header = io.read(9) # type 1, channel 4, payload 4
> File
> "/data/jenkins/workspace/impala-cdpd-master-staging-exhaustive-release/repos/Impala/infra/python/env-gcc7.5.0/lib/python2.7/site-packages/execnet/gateway_base.py",
> line 386, in read
> data = self._read(numbytes-len(buf))
> -- executing against localhost:21000
> SET MEM_LIMIT="0";
> -- 2020-11-12 01:12:03,717 INFO MainThread: Started query
> 8d4eac3249e55996:2b2053a700000000
> {noformat}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]