[ 
https://issues.apache.org/jira/browse/IMPALA-10050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17180309#comment-17180309
 ] 

Wenzhe Zhou commented on IMPALA-10050:
--------------------------------------

The issue could be re-produced by running following script for a few hours:

    #!/bin/bash

    set -euo pipefail

    for iter in `seq 1 100`; do
         ${IMPALA_HOME}/bin/impala-py.test 
tests/failure/test_failpoints.py::TestFailpoints::test_failpoints
    done

Added more debug messages in the code, and found that when executor report 
status with error, it may receive response with status 'ok' occasionally. If 
the executor is in the terminal state and receive response 'ok' for report with 
error, it will not call "Cancel()" hence hit the DCHECK error. The root cause 
is the bug in coordinator, not in executor.

The patch [https://gerrit.cloudera.org/#/c/16215/] only apply to executor so 
this issue was not introduced by the patch. Ran the above script with the code 
rolled back before patch [https://gerrit.cloudera.org/#/c/16215/], the issue 
still could be re-produced. 

Ran more tests with the script and confirmed that this issue was introduced by 
patch [https://gerrit.cloudera.org/#/c/16192/] 
([IMPALA-6788|http://issues.apache.org/jira/browse/IMPALA-6788]: Abort 
ExecFInstance() RPC loop early after query failure). 

 

  

 

> DCHECK was hit possibly while executing TestFailpoints::test_failpoints
> -----------------------------------------------------------------------
>
>                 Key: IMPALA-10050
>                 URL: https://issues.apache.org/jira/browse/IMPALA-10050
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>    Affects Versions: Impala 4.0
>            Reporter: Attila Jeges
>            Assignee: Wenzhe Zhou
>            Priority: Blocker
>              Labels: broken-build, crash, flaky
>             Fix For: Impala 4.0
>
>
> A DCHECK was hit during  ASAN core e2e tests. Time-frame suggests that it 
> happened while executing TestFailpoints::test_failpoints e2e test.
> {code}
> 10:56:38  TestFailpoints.test_failpoints[protocol: beeswax | table_format: 
> avro/snap/block | exec_option: {'batch_size': 0, 'num_nodes': 0, 
> 'disable_codegen_rows_threshold': 0, 'disable_codegen': False, 
> 'abort_on_error': 1, 'exec_single_node_rows_threshold': 0} | mt_dop: 4 | 
> location: PREPARE | action: MEM_LIMIT_EXCEEDED | query: select 1 from 
> alltypessmall a join alltypessmall b on a.id = b.id] 
> 10:56:38 failure/test_failpoints.py:128: in test_failpoints
> 10:56:38     self.execute_query(query, vector.get_value('exec_option'))
> 10:56:38 
> /data/jenkins/workspace/impala-asf-master-core-asan/repos/Impala/tests/common/impala_test_suite.py:811:
>  in wrapper
> 10:56:38     return function(*args, **kwargs)
> 10:56:38 
> /data/jenkins/workspace/impala-asf-master-core-asan/repos/Impala/tests/common/impala_test_suite.py:843:
>  in execute_query
> 10:56:38     return self.__execute_query(self.client, query, query_options)
> 10:56:38 
> /data/jenkins/workspace/impala-asf-master-core-asan/repos/Impala/tests/common/impala_test_suite.py:909:
>  in __execute_query
> 10:56:38     return impalad_client.execute(query, user=user)
> 10:56:38 
> /data/jenkins/workspace/impala-asf-master-core-asan/repos/Impala/tests/common/impala_connection.py:205:
>  in execute
> 10:56:38     return self.__beeswax_client.execute(sql_stmt, user=user)
> 10:56:38 
> /data/jenkins/workspace/impala-asf-master-core-asan/repos/Impala/tests/beeswax/impala_beeswax.py:187:
>  in execute
> 10:56:38     handle = self.__execute_query(query_string.strip(), user=user)
> 10:56:38 
> /data/jenkins/workspace/impala-asf-master-core-asan/repos/Impala/tests/beeswax/impala_beeswax.py:365:
>  in __execute_query
> 10:56:38     self.wait_for_finished(handle)
> 10:56:38 
> /data/jenkins/workspace/impala-asf-master-core-asan/repos/Impala/tests/beeswax/impala_beeswax.py:386:
>  in wait_for_finished
> 10:56:38     raise ImpalaBeeswaxException("Query aborted:" + error_log, None)
> 10:56:38 E   ImpalaBeeswaxException: ImpalaBeeswaxException:
> 10:56:38 E    Query aborted:RPC from 127.0.0.1:27000 to 127.0.0.1:27002 failed
> 10:56:38 E   TransmitData() to 127.0.0.1:27002 failed: Network error: Client 
> connection negotiation failed: client connection to 127.0.0.1:27002: connect: 
> Connection refused (error 111)
> {code}
> Impalad log:
> {code}
> Log file created at: 2020/08/05 01:52:56
> Running on machine: 
> impala-ec2-centos74-r5-4xlarge-ondemand-017c.vpc.cloudera.com
> Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
> F0805 01:52:56.979769 17313 query-state.cc:803] 
> 3941a3d92a71e242:15c963f300000000] Check failed: is_cancelled_.Load() == 1 (0 
> vs. 1) 
> {code}
> Stack trace
> {code}
> Thread 368 (crashed)
>  0  libc-2.17.so + 0x351f7
>     rax = 0x0000000000000000   rdx = 0x0000000000000006
>     rcx = 0xffffffffffffffff   rbx = 0x0000000000000004
>     rsi = 0x00000000000043a1   rdi = 0x00000000000037e4
>     rbp = 0x00007efcd4c53080   rsp = 0x00007efcd4c52d08
>      r8 = 0x0000000000000000    r9 = 0x00007efcd4c52b80
>     r10 = 0x0000000000000008   r11 = 0x0000000000000206
>     r12 = 0x00000000093de7c0   r13 = 0x0000000000000086
>     r14 = 0x00000000093de7c4   r15 = 0x00000000093d6de0
>     rip = 0x00007f05c9d231f7
>     Found by: given as instruction pointer in context
>  1  impalad!google::LogMessage::Flush() + 0x1eb
>     rbp = 0x00007efcd4c53250   rsp = 0x00007efcd4c53090
>     rip = 0x0000000005727e3b
>     Found by: previous frame's frame pointer
>  2  impalad!google::LogMessageFatal::~LogMessageFatal() + 0x9
>     rbx = 0x00007efcd4c532a0   rbp = 0x00007efcd4c53310
>     rsp = 0x00007efcd4c53130   r12 = 0x00000fe01a982628
>     r13 = 0x000061d000da0a6c   r14 = 0x00007efcd4c53250
>     r15 = 0x00007efcd4c53270   rip = 0x000000000572ba39
>     Found by: call frame info
>  3  impalad!impala::QueryState::MonitorFInstances() [query-state.cc : 803 + 
> 0x45]
>     rbx = 0x00007efcd4c532a0   rbp = 0x00007efcd4c53310
>     rsp = 0x00007efcd4c53140   r12 = 0x00000fe01a982628
>     r13 = 0x000061d000da0a6c   r14 = 0x00007efcd4c53250
>     r15 = 0x00007efcd4c53270   rip = 0x00000000024c1713
>     Found by: call frame info
>  4  impalad!impala::QueryExecMgr::ExecuteQueryHelper(impala::QueryState*) 
> [query-exec-mgr.cc : 161 + 0x8]
>     rbx = 0x000061d000da0280   rbp = 0x00007efcd4c53330
>     rsp = 0x00007efcd4c53320   r12 = 0x00007efcd4c53360
>     r13 = 0x00007efcd4c53340   r14 = 0x00000fdf9a98a668
>     r15 = 0x0000000000000000   rip = 0x00000000024a767d
>     Found by: call frame info
>  5  impalad!boost::_bi::bind_t<void, boost::_mfi::mf1<void, 
> impala::QueryExecMgr, impala::QueryState*>, 
> boost::_bi::list2<boost::_bi::value<impala::QueryExecMgr*>, 
> boost::_bi::value<impala::QueryState*> > >::operator()() 
> [bind.hpp : 1222 + 0xe]
>     rbx = 0x00007efcd4c53380   rbp = 0x00007efcd4c533f0
>     rsp = 0x00007efcd4c53340   r12 = 0x00007efcd4c53360
>     r13 = 0x00007efcd4c53340   r14 = 0x00000fdf9a98a668
>     r15 = 0x0000000000000000   rip = 0x00000000024b0f5c
>     Found by: call frame info
>  6  impalad!boost::function0<void>::operator()() const 
> [function_template.hpp : 770 + 0x5]
>     rbx = 0x00007efcd4c53440   rbp = 0x00007efcd4c534b0
>     rsp = 0x00007efcd4c53400   r12 = 0x00000fdf9a98a680
>     r13 = 0x00007efcd4c53400   r14 = 0x0000000000000000
>     r15 = 0x00007efcd4c53a88   rip = 0x000000000221c277
>     Found by: call frame info
>  7  
> impalad!impala::Thread::SuperviseThread(std::__cxx11::basic_string<char, 
> std::char_traits<char>, std::allocator<char> > const&, 
> std::__cxx11::basic_string<char, std::char_traits<char>, 
> std::allocator<char> > const&, boost::function<void ()>, 
> impala::ThreadDebugInfo const*, impala::Promise<long, 
> (impala::PromiseMode)0>*) [thread.cc : 360 + 0x9]
>     rbx = 0x00007efcd4c539a0   rbp = 0x00007efcd4c53a50
>     rsp = 0x00007efcd4c534c0   r12 = 0x00007efcd4c534c0
>     r13 = 0x00007efcd4c536c0   r14 = 0x00007efcd4c535f0
>     r15 = 0x00000fe01a982698   rip = 0x0000000002c005da
>     Found by: call frame info
>  8  impalad!void 
> boost::_bi::list5<boost::_bi::value<std::__cxx11::basic_string<char, 
> std::char_traits<char>, std::allocator<char> > >, 
> boost::_bi::value<std::__cxx11::basic_string<char, 
> std::char_traits<char>, std::allocator<char> > >, 
> boost::_bi::value<boost::function<void ()> >, 
> boost::_bi::value<impala::ThreadDebugInfo*>, 
> boost::_bi::value<impala::Promise<long, (impala::PromiseMode)0>*> 
> >::operator()<void (*)(std::__cxx11::basic_string<char, 
> std::char_traits<char>, std::allocator<char> > const&, 
> std::__cxx11::basic_string<char, std::char_traits<char>, 
> std::allocator<char> > const&, boost::function<void ()>, 
> impala::ThreadDebugInfo const*, impala::Promise<long, 
> (impala::PromiseMode)0>*), 
> boost::_bi::list0>(boost::_bi::type<void>, void 
> (*&)(std::__cxx11::basic_string<char, std::char_traits<char>, 
> std::allocator<char> > const&, 
> std::__cxx11::basic_string<char, std::char_traits<char>, 
> std::allocator<char> > const&, boost::function<void ()>, 
> impala::ThreadDebugInfo const*, impala::Promise<long, 
> (impala::PromiseMode)0>*), boost::_bi::list0&, int) [bind.hpp : 531 + 
> 0x12]
>     rbx = 0x00007efcd4c53ac0   rbp = 0x00007efcd4c53b50
>     rsp = 0x00007efcd4c53a60   r12 = 0x00006160006a73a8
>     r13 = 0x00007efcd4c53b80   r14 = 0x00007efcd4c53a60
>     r15 = 0x00007efd732796c0   rip = 0x0000000002c0ba19
>     Found by: call frame info
>  9  impalad!boost::_bi::bind_t<void, void 
> (*)(std::__cxx11::basic_string<char, std::char_traits<char>, 
> std::allocator<char> > const&, 
> std::__cxx11::basic_string<char, std::char_traits<char>, 
> std::allocator<char> > const&, boost::function<void ()>, 
> impala::ThreadDebugInfo const*, impala::Promise<long, 
> (impala::PromiseMode)0>*), 
> boost::_bi::list5<boost::_bi::value<std::__cxx11::basic_string<char, 
> std::char_traits<char>, std::allocator<char> > >, 
> boost::_bi::value<std::__cxx11::basic_string<char, 
> std::char_traits<char>, std::allocator<char> > >, 
> boost::_bi::value<boost::function<void ()> >, 
> boost::_bi::value<impala::ThreadDebugInfo*>, 
> boost::_bi::value<impala::Promise<long, (impala::PromiseMode)0>*> 
> > >::operator()() [bind.hpp : 1222 + 0xe]
>     rbx = 0x00007efcd4c53ba0   rbp = 0x00007efcd4c53c10
>     rsp = 0x00007efcd4c53b60   r12 = 0x00007efcd4c53b80
>     r13 = 0x00007efcd4c53b60   r14 = 0x00000fdf9a98a76c
>     r15 = 0x0000000000000000   rip = 0x0000000002c0b86c
>     Found by: call frame info
> 10  impalad!thread_proxy + 0x72
>     rbx = 0x00006160006a7180   rbp = 0x0000603000f98230
>     rsp = 0x00007efcd4c53c20   r12 = 0x0000000000000000
>     r13 = 0x00007efcd4c549c0   r14 = 0x00007efcd4c54700
>     r15 = 0x000000000000002b   rip = 0x000000000440b772
>     Found by: call frame info
> 11  libpthread-2.17.so + 0x7e25
>     rbx = 0x0000000000000000   rbp = 0x0000000000000000
>     rsp = 0x00007efcd4c53c60   r12 = 0x0000000000000000
>     r13 = 0x00007efcd4c549c0   r14 = 0x00007efcd4c54700
>     r15 = 0x000000000000002b   rip = 0x00007f05cd34fe25
>     Found by: call frame info
> 12  libc-2.17.so + 0xf834d
>     rsp = 0x00007efcd4c53d00
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to