[
https://issues.apache.org/jira/browse/IMPALA-10050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17180309#comment-17180309
]
Wenzhe Zhou commented on IMPALA-10050:
--------------------------------------
The issue could be re-produced by running following script for a few hours:
#!/bin/bash
set -euo pipefail
for iter in `seq 1 100`; do
${IMPALA_HOME}/bin/impala-py.test
tests/failure/test_failpoints.py::TestFailpoints::test_failpoints
done
Added more debug messages in the code, and found that when executor report
status with error, it may receive response with status 'ok' occasionally. If
the executor is in the terminal state and receive response 'ok' for report with
error, it will not call "Cancel()" hence hit the DCHECK error. The root cause
is the bug in coordinator, not in executor.
The patch [https://gerrit.cloudera.org/#/c/16215/] only apply to executor so
this issue was not introduced by the patch. Ran the above script with the code
rolled back before patch [https://gerrit.cloudera.org/#/c/16215/], the issue
still could be re-produced.
Ran more tests with the script and confirmed that this issue was introduced by
patch [https://gerrit.cloudera.org/#/c/16192/]
([IMPALA-6788|http://issues.apache.org/jira/browse/IMPALA-6788]: Abort
ExecFInstance() RPC loop early after query failure).
> DCHECK was hit possibly while executing TestFailpoints::test_failpoints
> -----------------------------------------------------------------------
>
> Key: IMPALA-10050
> URL: https://issues.apache.org/jira/browse/IMPALA-10050
> Project: IMPALA
> Issue Type: Bug
> Components: Backend
> Affects Versions: Impala 4.0
> Reporter: Attila Jeges
> Assignee: Wenzhe Zhou
> Priority: Blocker
> Labels: broken-build, crash, flaky
> Fix For: Impala 4.0
>
>
> A DCHECK was hit during ASAN core e2e tests. Time-frame suggests that it
> happened while executing TestFailpoints::test_failpoints e2e test.
> {code}
> 10:56:38 TestFailpoints.test_failpoints[protocol: beeswax | table_format:
> avro/snap/block | exec_option: {'batch_size': 0, 'num_nodes': 0,
> 'disable_codegen_rows_threshold': 0, 'disable_codegen': False,
> 'abort_on_error': 1, 'exec_single_node_rows_threshold': 0} | mt_dop: 4 |
> location: PREPARE | action: MEM_LIMIT_EXCEEDED | query: select 1 from
> alltypessmall a join alltypessmall b on a.id = b.id]
> 10:56:38 failure/test_failpoints.py:128: in test_failpoints
> 10:56:38 self.execute_query(query, vector.get_value('exec_option'))
> 10:56:38
> /data/jenkins/workspace/impala-asf-master-core-asan/repos/Impala/tests/common/impala_test_suite.py:811:
> in wrapper
> 10:56:38 return function(*args, **kwargs)
> 10:56:38
> /data/jenkins/workspace/impala-asf-master-core-asan/repos/Impala/tests/common/impala_test_suite.py:843:
> in execute_query
> 10:56:38 return self.__execute_query(self.client, query, query_options)
> 10:56:38
> /data/jenkins/workspace/impala-asf-master-core-asan/repos/Impala/tests/common/impala_test_suite.py:909:
> in __execute_query
> 10:56:38 return impalad_client.execute(query, user=user)
> 10:56:38
> /data/jenkins/workspace/impala-asf-master-core-asan/repos/Impala/tests/common/impala_connection.py:205:
> in execute
> 10:56:38 return self.__beeswax_client.execute(sql_stmt, user=user)
> 10:56:38
> /data/jenkins/workspace/impala-asf-master-core-asan/repos/Impala/tests/beeswax/impala_beeswax.py:187:
> in execute
> 10:56:38 handle = self.__execute_query(query_string.strip(), user=user)
> 10:56:38
> /data/jenkins/workspace/impala-asf-master-core-asan/repos/Impala/tests/beeswax/impala_beeswax.py:365:
> in __execute_query
> 10:56:38 self.wait_for_finished(handle)
> 10:56:38
> /data/jenkins/workspace/impala-asf-master-core-asan/repos/Impala/tests/beeswax/impala_beeswax.py:386:
> in wait_for_finished
> 10:56:38 raise ImpalaBeeswaxException("Query aborted:" + error_log, None)
> 10:56:38 E ImpalaBeeswaxException: ImpalaBeeswaxException:
> 10:56:38 E Query aborted:RPC from 127.0.0.1:27000 to 127.0.0.1:27002 failed
> 10:56:38 E TransmitData() to 127.0.0.1:27002 failed: Network error: Client
> connection negotiation failed: client connection to 127.0.0.1:27002: connect:
> Connection refused (error 111)
> {code}
> Impalad log:
> {code}
> Log file created at: 2020/08/05 01:52:56
> Running on machine:
> impala-ec2-centos74-r5-4xlarge-ondemand-017c.vpc.cloudera.com
> Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
> F0805 01:52:56.979769 17313 query-state.cc:803]
> 3941a3d92a71e242:15c963f300000000] Check failed: is_cancelled_.Load() == 1 (0
> vs. 1)
> {code}
> Stack trace
> {code}
> Thread 368 (crashed)
> 0 libc-2.17.so + 0x351f7
> rax = 0x0000000000000000 rdx = 0x0000000000000006
> rcx = 0xffffffffffffffff rbx = 0x0000000000000004
> rsi = 0x00000000000043a1 rdi = 0x00000000000037e4
> rbp = 0x00007efcd4c53080 rsp = 0x00007efcd4c52d08
> r8 = 0x0000000000000000 r9 = 0x00007efcd4c52b80
> r10 = 0x0000000000000008 r11 = 0x0000000000000206
> r12 = 0x00000000093de7c0 r13 = 0x0000000000000086
> r14 = 0x00000000093de7c4 r15 = 0x00000000093d6de0
> rip = 0x00007f05c9d231f7
> Found by: given as instruction pointer in context
> 1 impalad!google::LogMessage::Flush() + 0x1eb
> rbp = 0x00007efcd4c53250 rsp = 0x00007efcd4c53090
> rip = 0x0000000005727e3b
> Found by: previous frame's frame pointer
> 2 impalad!google::LogMessageFatal::~LogMessageFatal() + 0x9
> rbx = 0x00007efcd4c532a0 rbp = 0x00007efcd4c53310
> rsp = 0x00007efcd4c53130 r12 = 0x00000fe01a982628
> r13 = 0x000061d000da0a6c r14 = 0x00007efcd4c53250
> r15 = 0x00007efcd4c53270 rip = 0x000000000572ba39
> Found by: call frame info
> 3 impalad!impala::QueryState::MonitorFInstances() [query-state.cc : 803 +
> 0x45]
> rbx = 0x00007efcd4c532a0 rbp = 0x00007efcd4c53310
> rsp = 0x00007efcd4c53140 r12 = 0x00000fe01a982628
> r13 = 0x000061d000da0a6c r14 = 0x00007efcd4c53250
> r15 = 0x00007efcd4c53270 rip = 0x00000000024c1713
> Found by: call frame info
> 4 impalad!impala::QueryExecMgr::ExecuteQueryHelper(impala::QueryState*)
> [query-exec-mgr.cc : 161 + 0x8]
> rbx = 0x000061d000da0280 rbp = 0x00007efcd4c53330
> rsp = 0x00007efcd4c53320 r12 = 0x00007efcd4c53360
> r13 = 0x00007efcd4c53340 r14 = 0x00000fdf9a98a668
> r15 = 0x0000000000000000 rip = 0x00000000024a767d
> Found by: call frame info
> 5 impalad!boost::_bi::bind_t<void, boost::_mfi::mf1<void,
> impala::QueryExecMgr, impala::QueryState*>,
> boost::_bi::list2<boost::_bi::value<impala::QueryExecMgr*>,
> boost::_bi::value<impala::QueryState*> > >::operator()()
> [bind.hpp : 1222 + 0xe]
> rbx = 0x00007efcd4c53380 rbp = 0x00007efcd4c533f0
> rsp = 0x00007efcd4c53340 r12 = 0x00007efcd4c53360
> r13 = 0x00007efcd4c53340 r14 = 0x00000fdf9a98a668
> r15 = 0x0000000000000000 rip = 0x00000000024b0f5c
> Found by: call frame info
> 6 impalad!boost::function0<void>::operator()() const
> [function_template.hpp : 770 + 0x5]
> rbx = 0x00007efcd4c53440 rbp = 0x00007efcd4c534b0
> rsp = 0x00007efcd4c53400 r12 = 0x00000fdf9a98a680
> r13 = 0x00007efcd4c53400 r14 = 0x0000000000000000
> r15 = 0x00007efcd4c53a88 rip = 0x000000000221c277
> Found by: call frame info
> 7
> impalad!impala::Thread::SuperviseThread(std::__cxx11::basic_string<char,
> std::char_traits<char>, std::allocator<char> > const&,
> std::__cxx11::basic_string<char, std::char_traits<char>,
> std::allocator<char> > const&, boost::function<void ()>,
> impala::ThreadDebugInfo const*, impala::Promise<long,
> (impala::PromiseMode)0>*) [thread.cc : 360 + 0x9]
> rbx = 0x00007efcd4c539a0 rbp = 0x00007efcd4c53a50
> rsp = 0x00007efcd4c534c0 r12 = 0x00007efcd4c534c0
> r13 = 0x00007efcd4c536c0 r14 = 0x00007efcd4c535f0
> r15 = 0x00000fe01a982698 rip = 0x0000000002c005da
> Found by: call frame info
> 8 impalad!void
> boost::_bi::list5<boost::_bi::value<std::__cxx11::basic_string<char,
> std::char_traits<char>, std::allocator<char> > >,
> boost::_bi::value<std::__cxx11::basic_string<char,
> std::char_traits<char>, std::allocator<char> > >,
> boost::_bi::value<boost::function<void ()> >,
> boost::_bi::value<impala::ThreadDebugInfo*>,
> boost::_bi::value<impala::Promise<long, (impala::PromiseMode)0>*>
> >::operator()<void (*)(std::__cxx11::basic_string<char,
> std::char_traits<char>, std::allocator<char> > const&,
> std::__cxx11::basic_string<char, std::char_traits<char>,
> std::allocator<char> > const&, boost::function<void ()>,
> impala::ThreadDebugInfo const*, impala::Promise<long,
> (impala::PromiseMode)0>*),
> boost::_bi::list0>(boost::_bi::type<void>, void
> (*&)(std::__cxx11::basic_string<char, std::char_traits<char>,
> std::allocator<char> > const&,
> std::__cxx11::basic_string<char, std::char_traits<char>,
> std::allocator<char> > const&, boost::function<void ()>,
> impala::ThreadDebugInfo const*, impala::Promise<long,
> (impala::PromiseMode)0>*), boost::_bi::list0&, int) [bind.hpp : 531 +
> 0x12]
> rbx = 0x00007efcd4c53ac0 rbp = 0x00007efcd4c53b50
> rsp = 0x00007efcd4c53a60 r12 = 0x00006160006a73a8
> r13 = 0x00007efcd4c53b80 r14 = 0x00007efcd4c53a60
> r15 = 0x00007efd732796c0 rip = 0x0000000002c0ba19
> Found by: call frame info
> 9 impalad!boost::_bi::bind_t<void, void
> (*)(std::__cxx11::basic_string<char, std::char_traits<char>,
> std::allocator<char> > const&,
> std::__cxx11::basic_string<char, std::char_traits<char>,
> std::allocator<char> > const&, boost::function<void ()>,
> impala::ThreadDebugInfo const*, impala::Promise<long,
> (impala::PromiseMode)0>*),
> boost::_bi::list5<boost::_bi::value<std::__cxx11::basic_string<char,
> std::char_traits<char>, std::allocator<char> > >,
> boost::_bi::value<std::__cxx11::basic_string<char,
> std::char_traits<char>, std::allocator<char> > >,
> boost::_bi::value<boost::function<void ()> >,
> boost::_bi::value<impala::ThreadDebugInfo*>,
> boost::_bi::value<impala::Promise<long, (impala::PromiseMode)0>*>
> > >::operator()() [bind.hpp : 1222 + 0xe]
> rbx = 0x00007efcd4c53ba0 rbp = 0x00007efcd4c53c10
> rsp = 0x00007efcd4c53b60 r12 = 0x00007efcd4c53b80
> r13 = 0x00007efcd4c53b60 r14 = 0x00000fdf9a98a76c
> r15 = 0x0000000000000000 rip = 0x0000000002c0b86c
> Found by: call frame info
> 10 impalad!thread_proxy + 0x72
> rbx = 0x00006160006a7180 rbp = 0x0000603000f98230
> rsp = 0x00007efcd4c53c20 r12 = 0x0000000000000000
> r13 = 0x00007efcd4c549c0 r14 = 0x00007efcd4c54700
> r15 = 0x000000000000002b rip = 0x000000000440b772
> Found by: call frame info
> 11 libpthread-2.17.so + 0x7e25
> rbx = 0x0000000000000000 rbp = 0x0000000000000000
> rsp = 0x00007efcd4c53c60 r12 = 0x0000000000000000
> r13 = 0x00007efcd4c549c0 r14 = 0x00007efcd4c54700
> r15 = 0x000000000000002b rip = 0x00007f05cd34fe25
> Found by: call frame info
> 12 libc-2.17.so + 0xf834d
> rsp = 0x00007efcd4c53d00
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]