[ 
https://issues.apache.org/jira/browse/IMPALA-13107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenzhe Zhou resolved IMPALA-13107.
----------------------------------
    Fix Version/s: Impala 4.5.0
       Resolution: Fixed

> Invalid TExecPlanFragmentInfo received by executor with instance number as 0
> ----------------------------------------------------------------------------
>
>                 Key: IMPALA-13107
>                 URL: https://issues.apache.org/jira/browse/IMPALA-13107
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>            Reporter: Wenzhe Zhou
>            Assignee: Wenzhe Zhou
>            Priority: Major
>             Fix For: Impala 4.5.0
>
>
> In a customer reported case, TExecPlanFragmentInfo received by executors with 
> instance number equals 0, which caused impala daemon to crash. Here are log 
> messages collected on the Impala executors:
> {code:java}
> impalad.executor.net.impala.log.INFO.20240522-160138.197583:I0523 
> 00:59:16.892853 199528 control-service.cc:148] 
> 624c47e9264ebb62:5aa89af300000000] ExecQueryFInstances(): 
> query_id=624c47e9264ebb62:5aa89af300000000 coord=coordinator.net:27000 
> #instances=0
> ......
> I0523 00:59:19.306522 199185 kMinidump in thread 
> [1890723]query-state-624c47e9264ebb62:5aa89af300000000 running query 
> 624c47e9264ebb62:5aa89af300000000, fragment instance 
> 0000000000000000:0000000000000000
> Wrote minidump to 
> /var/log/impala-minidumps/impalad/021b06ea-1627-4c69-9f27858a-f3cd9026.dmp
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x00000000012ff9d9, pid=197583, tid=0x00007eefc98a0700
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_381) (build 1.8.0_381-b09)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.381-b09 mixed mode 
> linux-amd64 )
> # Problematic frame:
> # C  [impalad+0xeff9d9]  
> impala::FragmentState::FragmentState(impala::QueryState*, 
> impala::TPlanFragment const&, impala::PlanFragmentCtxPB const&)+0xf9
> #
> # Failed to write core dump. Core dumps have been disabled. To enable core 
> dumping, try "ulimit -c unlimited" before starting Java again
> #
> {code}
> From the collected profiles, there was no fragment with instance number as 0 
> in the corresponding query plan so coordinator should not send fragments to 
> executor with number of instances as 0.  Executor log files showed that there 
> were lots of KRPC errors around the time when receiving invalid 
> TExecPlanFragmentInfo. It seems KRPC messages were truncated due to KRPC 
> failures, but truncation might not cause thrift deserialization error. The 
> invalid TExecPlanFragmentInfo caused Impala daemon to crash with following 
> stack trace when the query was started on executor.
> {code:java}
> #0  SubstituteArg (value=..., this=0x7f86cec79d30) at 
> ../gutil/strings/substitute.h:79
> #1  impala::FragmentState::FragmentState (this=0x35c78f40, 
> query_state=0x7972db00, fragment=..., 
>     fragment_ctx=<error reading variable: Cannot access memory at address 
> 0x35c78f88>) at fragment-state.cc:143
> #2  0x00000000013019aa in impala::FragmentState::CreateFragmentStateMap 
> (fragment_info=..., exec_request=..., 
>     state=state@entry=0x7972db00, fragment_map=...) at fragment-state.cc:47
> #3  0x0000000001292d71 in impala::QueryState::StartFInstances 
> (this=this@entry=0x7972db00) at query-state.cc:820
> #4  0x0000000001284810 in impala::QueryExecMgr::ExecuteQueryHelper 
> (this=0x11943b00, qs=0x7972db00)
>     at query-exec-mgr.cc:162
> #5  0x0000000001752915 in operator() (this=0x7f86cec7ab40)
>     at 
> ../../../toolchain/toolchain-packages-gcc7.5.0/boost-1.61.0-p2/include/boost/function/function_template.hpp:770
> #6  impala::Thread::SuperviseThread(std::__cxx11::basic_string<char, 
> std::char_traits<char>, std::allocator<char> > const&, 
> std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> 
> > const&, boost::function<void ()>, impala::ThreadDebugInfo const*, 
> impala::Promise<long, (impala::PromiseMode)0>*) (name=..., category=..., 
> functor=..., 
>     parent_thread_info=<optimized out>, thread_started=0x7f87b7b9acb0) at 
> thread.cc:360
> #7  0x0000000001753c9b in operator()<void (*)(const 
> std::__cxx11::basic_string<char>&, const std::__cxx11::basic_string<char>&, 
> boost::function<void()>, const impala::ThreadDebugInfo*, impala::Promise<long 
> int>*), boost::_bi::list0> (
>     a=<synthetic pointer>, f=@0x1f66f3b8: <error reading variable>, 
> this=0x1f66f3c0)
>     at 
> ../../../toolchain/toolchain-packages-gcc7.5.0/boost-1.61.0-p2/include/boost/bind/bind.hpp:531
> #8  operator() (this=0x1f66f3b8)
>     at 
> ../../../toolchain/toolchain-packages-gcc7.5.0/boost-1.61.0-p2/include/boost/bind/bind.hpp:1222
> #9  boost::detail::thread_data<boost::_bi::bind_t<void, void 
> (*)(std::__cxx11::basic_string<char, std::char_traits<char>, 
> std::allocator<char> > const&, std::__cxx11::basic_string<char, 
> std::char_traits<char>, std::allocator<char> > const&, boost::function<void 
> ()>, impala::ThreadDebugInfo const*, impala::Promise<long, 
> (impala::PromiseMode)0>*), 
> boost::_bi::list5<boost::_bi::value<std::__cxx11::basic_string<char, 
> std::char_traits<char>, std::allocator<char> > >, 
> boost::_bi::value<std::__cxx11::basic_string<char, std::char_traits<char>, 
> std::allocator<char> > >, boost::_bi::value<boost::function<void ()> >, 
> boost::_bi::value<impala::ThreadDebugInfo*>, 
> boost::_bi::value<impala::Promise<long, (impala::PromiseMode)0>*> > > 
> >::run() (this=0x1f66f200)
>     at 
> ../../../toolchain/toolchain-packages-gcc7.5.0/boost-1.61.0-p2/include/boost/thread/detail/thread.hpp:116
> #10 0x0000000001fb4322 in thread_proxy ()
> #11 0x00007f98af288ea5 in start_thread () from /lib64/libpthread.so.0
> #12 0x00007f98ac2dfb0d in gnu_dev_makedev () from /lib64/libc.so.6
> #13 0x0000000000000000 in ?? ()
> {code}
> Note that this issue happened when extra loads were added to the Impala 
> cluster. It caused large RPC failures.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to