[
https://issues.apache.org/jira/browse/IMPALA-10339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17311000#comment-17311000
]
Wenzhe Zhou commented on IMPALA-10339:
--------------------------------------
Checked log file of one executor. Looks like that the executor got error, but
it did not report the error to coordinator somehow so that the query was not
cancelled. This seems the root cause.
I0327 12:57:36.543496 110552 control-service.cc:148]
30408ca68c884cb5:cd01ce9900000000] ExecQueryFInstances():
query_id=30408ca68c884cb5:cd01ce9900000000 coord=ip-172-31-45-110:27000
#instances=2
I0327 12:57:36.544034 89969 query-state.cc:948]
30408ca68c884cb5:cd01ce9900000003] Executing instance.
instance_id=30408ca68c884cb5:cd01ce9900000003 fragment_idx=2
per_fragment_instance_idx=2 coord_state_idx=1 #in-flight=1
I0327 12:57:36.544111 89970 query-state.cc:948]
30408ca68c884cb5:cd01ce9900000004] Executing instance.
instance_id=30408ca68c884cb5:cd01ce9900000004 fragment_idx=1
per_fragment_instance_idx=0 coord_state_idx=1 #in-flight=2
I0327 12:57:37.767164 110579 krpc-data-stream-mgr.cc:427] Reduced stream ID
cache from 354 items, to 323, eviction took: 0
I0327 12:57:39.663609 89969 query-state.cc:957]
30408ca68c884cb5:cd01ce9900000003] Instance completed.
instance_id=30408ca68c884cb5:cd01ce9900000003 #in-flight=1 status=OK
I0327 12:57:39.682893 89978 krpc-data-stream-mgr.cc:308]
30408ca68c884cb5:cd01ce9900000004] DeregisterRecvr():
fragment_instance_id=30408ca68c884cb5:cd01ce9900000004, node=4
...
I0327 12:57:41.529778 89970 *status.cc*:88] 30408ca68c884cb5:cd01ce9900000004]
*Cannot perform hash join at node with id 2*. Repartitioning did not reduce the
size of a spilled partition. Repartitioning level 2. Number of rows 1500000:
PhjBuilder op=INNER_JOIN is_separate_build=0 num_probe_threads=1
state=RepartitioningProbe Hash partitions: 16:
Hash partition 0 <Partition>: ptr=0x7f72d830 id=32 Closed
Hash partition 1 <Partition>: ptr=0x1204af5c0 id=33 Closed
Hash partition 2 <Partition>: ptr=0x1b937d1a0 id=34 Closed
Hash partition 3 <Partition>: ptr=0x1831ab740 id=35 Closed
Hash partition 4 <Partition>: ptr=0x1831abd70 id=36 Closed
Hash partition 5 <Partition>: ptr=0x1b937c360 id=37 Closed
Hash partition 6 <Partition>: ptr=0x7f72da10 id=38 Closed
Hash partition 7 <Partition>: ptr=0x45f098c0 id=39 Closed
Hash partition 8 <Partition>: ptr=0x45f09f20 id=40 Closed
Hash partition 9 <Partition>: ptr=0x7f72d950 id=41 Closed
Hash partition 10 <Partition>: ptr=0x10cc6a840 id=42 Closed
Hash partition 11 <Partition>: ptr=0x10cc6bc20 id=43 Closed
Hash partition 12 <Partition>: ptr=0x10cc6a390 id=44 Closed
Hash partition 13 <Partition>: ptr=0x10cc6ab40 id=45 Spilled
Build Rows: 1500000 (Bytes pinned: 0)
Spilled Probe Rows: 0
Hash partition 14 <Partition>: ptr=0x10cc6b8f0 id=46 Closed
Hash partition 15 <Partition>: ptr=0x6058cab0 id=47 Closed
Spilled partitions: 3:
Spilled partition 0 <Partition>: ptr=0x14c8985a0 id=0 Spilled
Build Rows: 1928409 (Bytes pinned: 0)
Spilled Probe Rows: 2
Spilled partition 1 <Partition>: ptr=0x14c898090 id=7 Spilled
Build Rows: 1929115 (Bytes pinned: 0)
Spilled Probe Rows: 2
Spilled partition 2 <Partition>: ptr=0x61b86090 id=29 Closed
buffer_pool_client=<BufferPool::Client> 0xfe3cb58 internal state:
{<BufferPool::Client> 0x8cef180 name: HASH_JOIN_NODE id=2 ptr=0xfe3ca00
write_status: buffers allocated 0 num_pages: 19068 pinned_bytes: 0
dirty_unpinned_bytes: 0 in_flight_write_bytes: 0 reservation:
{<ReservationTracker>: reservation_limit 9223372036854775807 reservation
206569472 used_reservation 0 child_reservations 1048576 parent:
<ReservationTracker>: reservation_limit 9223372036854775807 reservation
207618048 used_reservation 0 child_reservations 207618048 parent:
<ReservationTracker>: reservation_limit 209715200 reservation 209715200
used_reservation 0 child_reservations 209715200 parent:
<ReservationTracker>: reservation_limit 10952163328 reservation 209715200
used_reservation 0 child_reservations 209715200 parent:
NULL}
0 out of 0 pinned pages:
0 out of 0 dirty unpinned pages:
0 out of 0 in flight write pages: }
<BufferPool::Client> 0xfe3cb58 internal state: {<BufferPool::Client> 0x8cef180
name: HASH_JOIN_NODE id=2 ptr=0xfe3ca00 write_status: buffers allocated 0
num_pages: 19068 pinned_bytes: 0 dirty_unpinned_bytes: 0 in_flight_write_bytes:
0 reservation: {<ReservationTracker>: reservation_limit 9223372036854775807
reservation 206569472 used_reservation 0 child_reservations 1048576 parent:
<ReservationTracker>: reservation_limit 9223372036854775807 reservation
207618048 used_reservation 0 child_reservations 207618048 parent:
<ReservationTracker>: reservation_limit 209715200 reservation 209715200
used_reservation 0 child_reservations 209715200 parent:
<ReservationTracker>: reservation_limit 10952163328 reservation 209715200
used_reservation 0 child_reservations 209715200 parent:
NULL}
0 out of 0 pinned pages:
0 out of 0 dirty unpinned pages:
0 out of 0 in flight write pages: }
@ 0xc9187a
@ 0x16c71eb
@ 0x16c7970
@ 0x16d6be7
@ 0x16d7afa
@ 0x17074ca
@ 0x125f09f
@ 0x126180c
@ 0x11f5256
@ 0x1592b31
@ 0x1593faa
@ 0x1e1cfe1
@ 0x7f5efac496b9
@ 0x7f5ef76b64dc
I0327 12:57:41.550426 89970 *query-state.cc:957*]
30408ca68c884cb5:cd01ce9900000004] *Instance completed.*
instance_id=30408ca68c884cb5:cd01ce9900000004 #in-flight=0
*status=PARTITIONED_HASH_JOIN_REPARTITION_FAILS: Cannot perform hash join at
node with id 2.* Repartitioning did not reduce the size of a spilled partition.
Repartitioning level 2. Number of rows 1500000:
PhjBuilder op=INNER_JOIN is_separate_build=0 num_probe_threads=1
state=RepartitioningProbe Hash partitions: 16:
Hash partition 0 <Partition>: ptr=0x7f72d830 id=32 Closed
Hash partition 1 <Partition>: ptr=0x1204af5c0 id=33 Closed
Hash partition 2 <Partition>: ptr=0x1b937d1a0 id=34 Closed
Hash partition 3 <Partition>: ptr=0x1831ab740 id=35 Closed
Hash partition 4 <Partition>: ptr=0x1831abd70 id=36 Closed
Hash partition 5 <Partition>: ptr=0x1b937c360 id=37 Closed
Hash partition 6 <Partition>: ptr=0x7f72da10 id=38 Closed
Hash partition 7 <Partition>: ptr=0x45f098c0 id=39 Closed
Hash partition 8 <Partition>: ptr=0x45f09f20 id=40 Closed
Hash partition 9 <Partition>: ptr=0x7f72d950 id=41 Closed
Hash partition 10 <Partition>: ptr=0x10cc6a840 id=42 Closed
Hash partition 11 <Partition>: ptr=0x10cc6bc20 id=43 Closed
Hash partition 12 <Partition>: ptr=0x10cc6a390 id=44 Closed
Hash partition 13 <Partition>: ptr=0x10cc6ab40 id=45 Spilled
Build Rows: 1500000 (Bytes pinned: 0)
Spilled Probe Rows: 0
Hash partition 14 <Partition>: ptr=0x10cc6b8f0 id=46 Closed
Hash partition 15 <Partition>: ptr=0x6058cab0 id=47 Closed
Spilled partitions: 3:
Spilled partition 0 <Partition>: ptr=0x14c8985a0 id=0 Spilled
Build Rows: 1928409 (Bytes pinned: 0)
Spilled Probe Rows: 2
Spilled partition 1 <Partition>: ptr=0x14c898090 id=7 Spilled
Build Rows: 1929115 (Bytes pinned: 0)
Spilled Probe Rows: 2
Spilled partition 2 <Partition>: ptr=0x61b86090 id=29 Closed
buffer_pool_client=<BufferPool::Client> 0xfe3cb58 internal state:
{<BufferPool::Client> 0x8cef180 name: HASH_JOIN_NODE id=2 ptr=0xfe3ca00
write_status: buffers allocated 0 num_pages: 19068 pinned_bytes: 0
dirty_unpinned_bytes: 0 in_flight_write_bytes: 0 reservation:
{<ReservationTracker>: reservation_limit 9223372036854775807 reservation
206569472 used_reservation 0 child_reservations 1048576 parent:
<ReservationTracker>: reservation_limit 9223372036854775807 reservation
207618048 used_reservation 0 child_reservations 207618048 parent:
<ReservationTracker>: reservation_limit 209715200 reservation 209715200
used_reservation 0 child_reservations 209715200 parent:
<ReservationTracker>: reservation_limit 10952163328 reservation 209715200
used_reservation 0 child_reservations 209715200 parent:
NULL}
0 out of 0 pinned pages:
0 out of 0 dirty unpinned pages:
0 out of 0 in flight write pages: }
<BufferPool::Client> 0xfe3cb58 internal state: {<BufferPool::Client> 0x8cef180
name: HASH_JOIN_NODE id=2 ptr=0xfe3ca00 write_status: buffers allocated 0
num_pages: 19068 pinned_bytes: 0 dirty_unpinned_bytes: 0 in_flight_write_bytes:
0 reservation: {<ReservationTracker>: reservation_limit 9223372036854775807
reservation 206569472 used_reservation 0 child_reservations 1048576 parent:
<ReservationTracker>: reservation_limit 9223372036854775807 reservation
207618048 used_reservation 0 child_reservations 207618048 parent:
<ReservationTracker>: reservation_limit 209715200 reservation 209715200
used_reservation 0 child_reservations 209715200 parent:
<ReservationTracker>: reservation_limit 10952163328 reservation 209715200
used_reservation 0 child_reservations 209715200 parent:
NULL}
0 out of 0 pinned pages:
0 out of 0 dirty unpinned pages:
0 out of 0 in flight write pages: }
I0327 12:57:41.559296 89964 *query-state.cc:468*]
30408ca68c884cb5:cd01ce9900000000] UpdateBackendExecState(): *last report for
30408ca68c884cb5*:cd01ce9900000000
I0327 12:57:41.826570 89964 query-exec-mgr.cc:213]
30408ca68c884cb5:cd01ce9900000000] ReleaseQueryState(): deleted
query_id=30408ca68c884cb5:cd01ce9900000000
> Apparent hang or crash in
> TestSpillingNoDebugActionDimensions.test_spilling_no_debug_action
> -------------------------------------------------------------------------------------------
>
> Key: IMPALA-10339
> URL: https://issues.apache.org/jira/browse/IMPALA-10339
> Project: IMPALA
> Issue Type: Bug
> Components: Backend
> Affects Versions: Impala 4.0
> Reporter: Tim Armstrong
> Assignee: Wenzhe Zhou
> Priority: Blocker
> Labels: broken-build, flaky, hang
>
> Release build with this commit as the tip:
> {noformat}
> commit 9400e9b17b13f613defb6d7b9deb471256b1d95c (CDH/cdpd-master-staging)
> Author: wzhou-code <[email protected]>
> Date: Thu Oct 29 22:32:03 2020 -0700
> IMPALA-10305: Sync Kudu's FIPS compliant changes
>
> {noformat}
> {noformat}
> Regression
> query_test.test_spilling.TestSpillingNoDebugActionDimensions.test_spilling_no_debug_action[protocol:
> beeswax | exec_option: {'mt_dop': 0, 'default_spillable_buffer_size': '64k'}
> | table_format: parquet/none] (from pytest)
> Failing for the past 1 build (Since Failed#100 )
> Took 1 hr 59 min.
> add description
> Error Message
> query_test/test_spilling.py:134: in test_spilling_no_debug_action
> self.run_test_case('QueryTest/spilling-no-debug-action', vector)
> common/impala_test_suite.py:668: in run_test_case
> self.__verify_exceptions(test_section['CATCH'], str(e), use_db)
> common/impala_test_suite.py:485: in __verify_exceptions (expected_str,
> actual_str) E AssertionError: Unexpected exception string. Expected:
> row_regex:.*Cannot perform hash join at node with id .*. Repartitioning did
> not reduce the size of a spilled partition.* E Not found in actual: Timeout
> >7200s
> Stacktrace
> query_test/test_spilling.py:134: in test_spilling_no_debug_action
> self.run_test_case('QueryTest/spilling-no-debug-action', vector)
> common/impala_test_suite.py:668: in run_test_case
> self.__verify_exceptions(test_section['CATCH'], str(e), use_db)
> common/impala_test_suite.py:485: in __verify_exceptions
> (expected_str, actual_str)
> E AssertionError: Unexpected exception string. Expected: row_regex:.*Cannot
> perform hash join at node with id .*. Repartitioning did not reduce the size
> of a spilled partition.*
> E Not found in actual: Timeout >7200s
> Standard Error
> SET
> client_identifier=query_test/test_spilling.py::TestSpillingNoDebugActionDimensions::()::test_spilling_no_debug_action[protocol:beeswax|exec_option:{'mt_dop':0;'default_spillable_buffer_size':'64k'}|table_format:parquet/none];
> -- executing against localhost:21000
> use tpch_parquet;
> -- 2020-11-11 23:12:04,319 INFO MainThread: Started query
> c740c1c66d9679a9:6a40f16100000000
> SET
> client_identifier=query_test/test_spilling.py::TestSpillingNoDebugActionDimensions::()::test_spilling_no_debug_action[protocol:beeswax|exec_option:{'mt_dop':0;'default_spillable_buffer_size':'64k'}|table_format:parquet/none];
> SET mt_dop=0;
> SET default_spillable_buffer_size=64k;
> -- 2020-11-11 23:12:04,320 INFO MainThread: Loading query test file:
> /data/jenkins/workspace/impala-cdpd-master-staging-exhaustive-release/repos/Impala/testdata/workloads/functional-query/queries/QueryTest/spilling-no-debug-action.test
> -- 2020-11-11 23:12:04,323 INFO MainThread: Starting new HTTP connection
> (1): localhost
> -- executing against localhost:21000
> set debug_action="-1:OPEN:[email protected]";
> -- 2020-11-11 23:12:04,377 INFO MainThread: Started query
> c044afcf5ae44df9:a2e2e7c600000000
> -- executing against localhost:21000
> select straight_join count(*)
> from
> lineitem a, lineitem b
> where
> a.l_partkey = 1 and
> a.l_orderkey = b.l_orderkey;
> -- 2020-11-11 23:12:04,385 INFO MainThread: Started query
> 314c019cd252f322:2411bc7600000000
> -- executing against localhost:21000
> SET DEBUG_ACTION="";
> -- 2020-11-11 23:12:05,199 INFO MainThread: Started query
> 80424e68922c30f9:b2144dff00000000
> -- executing against localhost:21000
> set debug_action="-1:OPEN:[email protected]";
> -- 2020-11-11 23:12:05,207 INFO MainThread: Started query
> 2a4c1f4b26ea52da:4339f3ff00000000
> -- executing against localhost:21000
> select straight_join count(*)
> from
> lineitem a
> where
> a.l_partkey not in (select l_partkey from lineitem where l_partkey > 10)
> and a.l_partkey < 1000;
> -- 2020-11-11 23:12:05,215 INFO MainThread: Started query
> f845afd00a569446:79c5054a00000000
> -- executing against localhost:21000
> SET DEBUG_ACTION="";
> -- 2020-11-11 23:12:07,507 INFO MainThread: Started query
> ee4f8a685928e7ef:830d965100000000
> -- executing against localhost:21000
> set debug_action="-1:OPEN:[email protected]";
> -- 2020-11-11 23:12:07,512 INFO MainThread: Started query
> 654a6eced9594931:cc68289a00000000
> -- executing against localhost:21000
> select straight_join count(*)
> from
> supplier right outer join lineitem on s_suppkey = l_suppkey
> where s_acctbal > 0 and s_acctbal < 10;
> -- 2020-11-11 23:12:07,519 INFO MainThread: Started query
> 7a41a406b446e082:b5e3bf2f00000000
> -- executing against localhost:21000
> SET DEBUG_ACTION="";
> -- 2020-11-11 23:12:08,549 INFO MainThread: Started query
> 1445a2833895ee1a:d136681000000000
> -- executing against localhost:21000
> set debug_action="-1:OPEN:[email protected]";
> -- 2020-11-11 23:12:08,554 INFO MainThread: Started query
> 4149ef276c643426:a285156400000000
> -- executing against localhost:21000
> select straight_join count(*)
> from
> supplier right outer join lineitem on s_suppkey = l_suppkey
> where s_acctbal > 0 and s_acctbal < 10;
> -- 2020-11-11 23:12:08,562 INFO MainThread: Started query
> 58427e99dadb6ca9:0d184f2700000000
> -- executing against localhost:21000
> SET DEBUG_ACTION="";
> -- 2020-11-11 23:12:09,586 INFO MainThread: Started query
> 1d498d1d50b86ed3:616ba0e600000000
> -- executing against localhost:21000
> set debug_action="-1:OPEN:[email protected]";
> -- 2020-11-11 23:12:09,592 INFO MainThread: Started query
> 4543a32357f9b4cb:e3dbfab800000000
> -- executing against localhost:21000
> with x as (select * from supplier limit 10)
> select straight_join count(*)
> from
> x right anti join lineitem on s_suppkey + 100 = l_suppkey;
> -- 2020-11-11 23:12:09,599 INFO MainThread: Started query
> 9e4ba8fd3dbbf2bd:d73094e700000000
> -- executing against localhost:21000
> SET DEBUG_ACTION="";
> -- 2020-11-11 23:12:10,320 INFO MainThread: Started query
> 62471da179150bd1:ffa7dd7000000000
> -- executing against localhost:21000
> set mem_limit=75m;
> -- 2020-11-11 23:12:10,326 INFO MainThread: Started query
> 8445afe55c819986:0915d3f400000000
> -- executing against localhost:21000
> select l_orderkey, group_concat(repeat(l_comment, 10)) comments
> from lineitem
> group by l_orderkey
> order by comments desc
> limit 5;
> -- 2020-11-11 23:12:10,332 INFO MainThread: Started query
> d84ba6a837b98c8d:c7c73fd500000000
> -- executing against localhost:21000
> SET MEM_LIMIT="0";
> -- 2020-11-11 23:12:10,638 INFO MainThread: Started query
> 58476b9f3b524901:bed8346900000000
> -- executing against localhost:21000
> set topn_bytes_limit=-1;
> -- 2020-11-11 23:12:10,642 INFO MainThread: Started query
> 2742e503cfca610e:a4029e4800000000
> -- executing against localhost:21000
> set mem_limit=100m;
> -- 2020-11-11 23:12:10,648 INFO MainThread: Started query
> 9a4f67b8c8a95465:67f8143100000000
> -- executing against localhost:21000
> select *
> from lineitem
> order by l_orderkey desc
> limit 6000000;
> -- 2020-11-11 23:12:10,654 INFO MainThread: Started query
> 7147f6f62984fdb2:614cab6a00000000
> -- executing against localhost:21000
> SET TOPN_BYTES_LIMIT="536870912";
> -- 2020-11-11 23:12:10,859 INFO MainThread: Started query
> 4642e316a7f90110:c8344b0d00000000
> -- executing against localhost:21000
> SET MEM_LIMIT="0";
> -- 2020-11-11 23:12:10,863 INFO MainThread: Started query
> 414724e3a0a7f290:d1a0e69000000000
> -- executing against localhost:21000
> set mem_limit=250m;
> -- 2020-11-11 23:12:10,867 INFO MainThread: Started query
> 854c0c0d1569c7f5:3635e5ae00000000
> -- executing against localhost:21000
> select straight_join *
> from supplier join /* +broadcast */ lineitem on s_suppkey = l_linenumber
> order by l_tax desc
> limit 5;
> -- 2020-11-11 23:12:10,873 INFO MainThread: Started query
> de444e75ef44ee0b:a1614bfe00000000
> ~~~~~~~~~~~~~~~~~~~~~ Stack of <unknown> (140237730514688)
> ~~~~~~~~~~~~~~~~~~~~~
> File
> "/data/jenkins/workspace/impala-cdpd-master-staging-exhaustive-release/repos/Impala/infra/python/env-gcc7.5.0/lib/python2.7/site-packages/execnet/gateway_base.py",
> line 277, in _perform_spawn
> reply.run()
> File
> "/data/jenkins/workspace/impala-cdpd-master-staging-exhaustive-release/repos/Impala/infra/python/env-gcc7.5.0/lib/python2.7/site-packages/execnet/gateway_base.py",
> line 213, in run
> self._result = func(*args, **kwargs)
> File
> "/data/jenkins/workspace/impala-cdpd-master-staging-exhaustive-release/repos/Impala/infra/python/env-gcc7.5.0/lib/python2.7/site-packages/execnet/gateway_base.py",
> line 954, in _thread_receiver
> msg = Message.from_io(io)
> File
> "/data/jenkins/workspace/impala-cdpd-master-staging-exhaustive-release/repos/Impala/infra/python/env-gcc7.5.0/lib/python2.7/site-packages/execnet/gateway_base.py",
> line 418, in from_io
> header = io.read(9) # type 1, channel 4, payload 4
> File
> "/data/jenkins/workspace/impala-cdpd-master-staging-exhaustive-release/repos/Impala/infra/python/env-gcc7.5.0/lib/python2.7/site-packages/execnet/gateway_base.py",
> line 386, in read
> data = self._read(numbytes-len(buf))
> -- executing against localhost:21000
> SET MEM_LIMIT="0";
> -- 2020-11-12 01:12:03,717 INFO MainThread: Started query
> 8d4eac3249e55996:2b2053a700000000
> {noformat}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]