[
https://issues.apache.org/jira/browse/IMPALA-10050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17186680#comment-17186680
]
ASF subversion and git services commented on IMPALA-10050:
----------------------------------------------------------
Commit 3733c4cc2cfb78d7f13463fb1ee9e1c4560d4a3d in impala's branch
refs/heads/master from wzhou-code
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=3733c4c ]
IMPALA-10050: Fixed DCHECK error for backend in terminal state.
Recent patch for IMPALA-6788 makes coordinator to cancel inflight
query fragment instances when it receives failure report from one
backend. It's possible the BackendState::Cancel() is called for
one fragment instance before the first execution status report
from its backend is received and processed by the coordinator.
Since the status of BackendState is set as Cancelled after Cancel()
is called, the execution of the fragment instance is treated as
Done in such case so that the status report will NOT be processed.
Hence the backend receives response OK from coordinator even it
sent a report with execution error. This make backend hit DCHECK
error if backend in the terminal state with error.
This patch fixs the issue by making coordinator send CANCELLED
status in the response of status report if the backend status is not
ok and the execution status report is not applied.
Testing:
- The issue could be reproduced by running test_failpoints for about
20 iterations. Verified the fixing by running test_failpoints over
200 iterations without DCHECK failure.
- Passed TestProcessFailures::test_kill_coordinator.
- Psssed TestRPCException::test_state_report_error.
- Passed exhaustive tests.
Change-Id: Iba6a72f98c0f9299c22c58830ec5a643335b966a
Reviewed-on: http://gerrit.cloudera.org:8080/16303
Reviewed-by: Thomas Tauber-Marshall <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>
> DCHECK was hit possibly while executing TestFailpoints::test_failpoints
> -----------------------------------------------------------------------
>
> Key: IMPALA-10050
> URL: https://issues.apache.org/jira/browse/IMPALA-10050
> Project: IMPALA
> Issue Type: Bug
> Components: Backend
> Affects Versions: Impala 4.0
> Reporter: Attila Jeges
> Assignee: Wenzhe Zhou
> Priority: Blocker
> Labels: broken-build, crash, flaky
> Fix For: Impala 4.0
>
>
> A DCHECK was hit during ASAN core e2e tests. Time-frame suggests that it
> happened while executing TestFailpoints::test_failpoints e2e test.
> {code}
> 10:56:38 TestFailpoints.test_failpoints[protocol: beeswax | table_format:
> avro/snap/block | exec_option: {'batch_size': 0, 'num_nodes': 0,
> 'disable_codegen_rows_threshold': 0, 'disable_codegen': False,
> 'abort_on_error': 1, 'exec_single_node_rows_threshold': 0} | mt_dop: 4 |
> location: PREPARE | action: MEM_LIMIT_EXCEEDED | query: select 1 from
> alltypessmall a join alltypessmall b on a.id = b.id]
> 10:56:38 failure/test_failpoints.py:128: in test_failpoints
> 10:56:38 self.execute_query(query, vector.get_value('exec_option'))
> 10:56:38
> /data/jenkins/workspace/impala-asf-master-core-asan/repos/Impala/tests/common/impala_test_suite.py:811:
> in wrapper
> 10:56:38 return function(*args, **kwargs)
> 10:56:38
> /data/jenkins/workspace/impala-asf-master-core-asan/repos/Impala/tests/common/impala_test_suite.py:843:
> in execute_query
> 10:56:38 return self.__execute_query(self.client, query, query_options)
> 10:56:38
> /data/jenkins/workspace/impala-asf-master-core-asan/repos/Impala/tests/common/impala_test_suite.py:909:
> in __execute_query
> 10:56:38 return impalad_client.execute(query, user=user)
> 10:56:38
> /data/jenkins/workspace/impala-asf-master-core-asan/repos/Impala/tests/common/impala_connection.py:205:
> in execute
> 10:56:38 return self.__beeswax_client.execute(sql_stmt, user=user)
> 10:56:38
> /data/jenkins/workspace/impala-asf-master-core-asan/repos/Impala/tests/beeswax/impala_beeswax.py:187:
> in execute
> 10:56:38 handle = self.__execute_query(query_string.strip(), user=user)
> 10:56:38
> /data/jenkins/workspace/impala-asf-master-core-asan/repos/Impala/tests/beeswax/impala_beeswax.py:365:
> in __execute_query
> 10:56:38 self.wait_for_finished(handle)
> 10:56:38
> /data/jenkins/workspace/impala-asf-master-core-asan/repos/Impala/tests/beeswax/impala_beeswax.py:386:
> in wait_for_finished
> 10:56:38 raise ImpalaBeeswaxException("Query aborted:" + error_log, None)
> 10:56:38 E ImpalaBeeswaxException: ImpalaBeeswaxException:
> 10:56:38 E Query aborted:RPC from 127.0.0.1:27000 to 127.0.0.1:27002 failed
> 10:56:38 E TransmitData() to 127.0.0.1:27002 failed: Network error: Client
> connection negotiation failed: client connection to 127.0.0.1:27002: connect:
> Connection refused (error 111)
> {code}
> Impalad log:
> {code}
> Log file created at: 2020/08/05 01:52:56
> Running on machine:
> impala-ec2-centos74-r5-4xlarge-ondemand-017c.vpc.cloudera.com
> Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
> F0805 01:52:56.979769 17313 query-state.cc:803]
> 3941a3d92a71e242:15c963f300000000] Check failed: is_cancelled_.Load() == 1 (0
> vs. 1)
> {code}
> Stack trace
> {code}
> Thread 368 (crashed)
> 0 libc-2.17.so + 0x351f7
> rax = 0x0000000000000000 rdx = 0x0000000000000006
> rcx = 0xffffffffffffffff rbx = 0x0000000000000004
> rsi = 0x00000000000043a1 rdi = 0x00000000000037e4
> rbp = 0x00007efcd4c53080 rsp = 0x00007efcd4c52d08
> r8 = 0x0000000000000000 r9 = 0x00007efcd4c52b80
> r10 = 0x0000000000000008 r11 = 0x0000000000000206
> r12 = 0x00000000093de7c0 r13 = 0x0000000000000086
> r14 = 0x00000000093de7c4 r15 = 0x00000000093d6de0
> rip = 0x00007f05c9d231f7
> Found by: given as instruction pointer in context
> 1 impalad!google::LogMessage::Flush() + 0x1eb
> rbp = 0x00007efcd4c53250 rsp = 0x00007efcd4c53090
> rip = 0x0000000005727e3b
> Found by: previous frame's frame pointer
> 2 impalad!google::LogMessageFatal::~LogMessageFatal() + 0x9
> rbx = 0x00007efcd4c532a0 rbp = 0x00007efcd4c53310
> rsp = 0x00007efcd4c53130 r12 = 0x00000fe01a982628
> r13 = 0x000061d000da0a6c r14 = 0x00007efcd4c53250
> r15 = 0x00007efcd4c53270 rip = 0x000000000572ba39
> Found by: call frame info
> 3 impalad!impala::QueryState::MonitorFInstances() [query-state.cc : 803 +
> 0x45]
> rbx = 0x00007efcd4c532a0 rbp = 0x00007efcd4c53310
> rsp = 0x00007efcd4c53140 r12 = 0x00000fe01a982628
> r13 = 0x000061d000da0a6c r14 = 0x00007efcd4c53250
> r15 = 0x00007efcd4c53270 rip = 0x00000000024c1713
> Found by: call frame info
> 4 impalad!impala::QueryExecMgr::ExecuteQueryHelper(impala::QueryState*)
> [query-exec-mgr.cc : 161 + 0x8]
> rbx = 0x000061d000da0280 rbp = 0x00007efcd4c53330
> rsp = 0x00007efcd4c53320 r12 = 0x00007efcd4c53360
> r13 = 0x00007efcd4c53340 r14 = 0x00000fdf9a98a668
> r15 = 0x0000000000000000 rip = 0x00000000024a767d
> Found by: call frame info
> 5 impalad!boost::_bi::bind_t<void, boost::_mfi::mf1<void,
> impala::QueryExecMgr, impala::QueryState*>,
> boost::_bi::list2<boost::_bi::value<impala::QueryExecMgr*>,
> boost::_bi::value<impala::QueryState*> > >::operator()()
> [bind.hpp : 1222 + 0xe]
> rbx = 0x00007efcd4c53380 rbp = 0x00007efcd4c533f0
> rsp = 0x00007efcd4c53340 r12 = 0x00007efcd4c53360
> r13 = 0x00007efcd4c53340 r14 = 0x00000fdf9a98a668
> r15 = 0x0000000000000000 rip = 0x00000000024b0f5c
> Found by: call frame info
> 6 impalad!boost::function0<void>::operator()() const
> [function_template.hpp : 770 + 0x5]
> rbx = 0x00007efcd4c53440 rbp = 0x00007efcd4c534b0
> rsp = 0x00007efcd4c53400 r12 = 0x00000fdf9a98a680
> r13 = 0x00007efcd4c53400 r14 = 0x0000000000000000
> r15 = 0x00007efcd4c53a88 rip = 0x000000000221c277
> Found by: call frame info
> 7
> impalad!impala::Thread::SuperviseThread(std::__cxx11::basic_string<char,
> std::char_traits<char>, std::allocator<char> > const&,
> std::__cxx11::basic_string<char, std::char_traits<char>,
> std::allocator<char> > const&, boost::function<void ()>,
> impala::ThreadDebugInfo const*, impala::Promise<long,
> (impala::PromiseMode)0>*) [thread.cc : 360 + 0x9]
> rbx = 0x00007efcd4c539a0 rbp = 0x00007efcd4c53a50
> rsp = 0x00007efcd4c534c0 r12 = 0x00007efcd4c534c0
> r13 = 0x00007efcd4c536c0 r14 = 0x00007efcd4c535f0
> r15 = 0x00000fe01a982698 rip = 0x0000000002c005da
> Found by: call frame info
> 8 impalad!void
> boost::_bi::list5<boost::_bi::value<std::__cxx11::basic_string<char,
> std::char_traits<char>, std::allocator<char> > >,
> boost::_bi::value<std::__cxx11::basic_string<char,
> std::char_traits<char>, std::allocator<char> > >,
> boost::_bi::value<boost::function<void ()> >,
> boost::_bi::value<impala::ThreadDebugInfo*>,
> boost::_bi::value<impala::Promise<long, (impala::PromiseMode)0>*>
> >::operator()<void (*)(std::__cxx11::basic_string<char,
> std::char_traits<char>, std::allocator<char> > const&,
> std::__cxx11::basic_string<char, std::char_traits<char>,
> std::allocator<char> > const&, boost::function<void ()>,
> impala::ThreadDebugInfo const*, impala::Promise<long,
> (impala::PromiseMode)0>*),
> boost::_bi::list0>(boost::_bi::type<void>, void
> (*&)(std::__cxx11::basic_string<char, std::char_traits<char>,
> std::allocator<char> > const&,
> std::__cxx11::basic_string<char, std::char_traits<char>,
> std::allocator<char> > const&, boost::function<void ()>,
> impala::ThreadDebugInfo const*, impala::Promise<long,
> (impala::PromiseMode)0>*), boost::_bi::list0&, int) [bind.hpp : 531 +
> 0x12]
> rbx = 0x00007efcd4c53ac0 rbp = 0x00007efcd4c53b50
> rsp = 0x00007efcd4c53a60 r12 = 0x00006160006a73a8
> r13 = 0x00007efcd4c53b80 r14 = 0x00007efcd4c53a60
> r15 = 0x00007efd732796c0 rip = 0x0000000002c0ba19
> Found by: call frame info
> 9 impalad!boost::_bi::bind_t<void, void
> (*)(std::__cxx11::basic_string<char, std::char_traits<char>,
> std::allocator<char> > const&,
> std::__cxx11::basic_string<char, std::char_traits<char>,
> std::allocator<char> > const&, boost::function<void ()>,
> impala::ThreadDebugInfo const*, impala::Promise<long,
> (impala::PromiseMode)0>*),
> boost::_bi::list5<boost::_bi::value<std::__cxx11::basic_string<char,
> std::char_traits<char>, std::allocator<char> > >,
> boost::_bi::value<std::__cxx11::basic_string<char,
> std::char_traits<char>, std::allocator<char> > >,
> boost::_bi::value<boost::function<void ()> >,
> boost::_bi::value<impala::ThreadDebugInfo*>,
> boost::_bi::value<impala::Promise<long, (impala::PromiseMode)0>*>
> > >::operator()() [bind.hpp : 1222 + 0xe]
> rbx = 0x00007efcd4c53ba0 rbp = 0x00007efcd4c53c10
> rsp = 0x00007efcd4c53b60 r12 = 0x00007efcd4c53b80
> r13 = 0x00007efcd4c53b60 r14 = 0x00000fdf9a98a76c
> r15 = 0x0000000000000000 rip = 0x0000000002c0b86c
> Found by: call frame info
> 10 impalad!thread_proxy + 0x72
> rbx = 0x00006160006a7180 rbp = 0x0000603000f98230
> rsp = 0x00007efcd4c53c20 r12 = 0x0000000000000000
> r13 = 0x00007efcd4c549c0 r14 = 0x00007efcd4c54700
> r15 = 0x000000000000002b rip = 0x000000000440b772
> Found by: call frame info
> 11 libpthread-2.17.so + 0x7e25
> rbx = 0x0000000000000000 rbp = 0x0000000000000000
> rsp = 0x00007efcd4c53c60 r12 = 0x0000000000000000
> r13 = 0x00007efcd4c549c0 r14 = 0x00007efcd4c54700
> r15 = 0x000000000000002b rip = 0x00007f05cd34fe25
> Found by: call frame info
> 12 libc-2.17.so + 0xf834d
> rsp = 0x00007efcd4c53d00
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]