[
https://issues.apache.org/jira/browse/IMPALA-13107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wenzhe Zhou resolved IMPALA-13107.
----------------------------------
Fix Version/s: Impala 4.5.0
Resolution: Fixed
> Invalid TExecPlanFragmentInfo received by executor with instance number as 0
> ----------------------------------------------------------------------------
>
> Key: IMPALA-13107
> URL: https://issues.apache.org/jira/browse/IMPALA-13107
> Project: IMPALA
> Issue Type: Bug
> Components: Backend
> Reporter: Wenzhe Zhou
> Assignee: Wenzhe Zhou
> Priority: Major
> Fix For: Impala 4.5.0
>
>
> In a customer reported case, TExecPlanFragmentInfo received by executors with
> instance number equals 0, which caused impala daemon to crash. Here are log
> messages collected on the Impala executors:
> {code:java}
> impalad.executor.net.impala.log.INFO.20240522-160138.197583:I0523
> 00:59:16.892853 199528 control-service.cc:148]
> 624c47e9264ebb62:5aa89af300000000] ExecQueryFInstances():
> query_id=624c47e9264ebb62:5aa89af300000000 coord=coordinator.net:27000
> #instances=0
> ......
> I0523 00:59:19.306522 199185 kMinidump in thread
> [1890723]query-state-624c47e9264ebb62:5aa89af300000000 running query
> 624c47e9264ebb62:5aa89af300000000, fragment instance
> 0000000000000000:0000000000000000
> Wrote minidump to
> /var/log/impala-minidumps/impalad/021b06ea-1627-4c69-9f27858a-f3cd9026.dmp
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> # SIGSEGV (0xb) at pc=0x00000000012ff9d9, pid=197583, tid=0x00007eefc98a0700
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_381) (build 1.8.0_381-b09)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.381-b09 mixed mode
> linux-amd64 )
> # Problematic frame:
> # C [impalad+0xeff9d9]
> impala::FragmentState::FragmentState(impala::QueryState*,
> impala::TPlanFragment const&, impala::PlanFragmentCtxPB const&)+0xf9
> #
> # Failed to write core dump. Core dumps have been disabled. To enable core
> dumping, try "ulimit -c unlimited" before starting Java again
> #
> {code}
> From the collected profiles, there was no fragment with instance number as 0
> in the corresponding query plan so coordinator should not send fragments to
> executor with number of instances as 0. Executor log files showed that there
> were lots of KRPC errors around the time when receiving invalid
> TExecPlanFragmentInfo. It seems KRPC messages were truncated due to KRPC
> failures, but truncation might not cause thrift deserialization error. The
> invalid TExecPlanFragmentInfo caused Impala daemon to crash with following
> stack trace when the query was started on executor.
> {code:java}
> #0 SubstituteArg (value=..., this=0x7f86cec79d30) at
> ../gutil/strings/substitute.h:79
> #1 impala::FragmentState::FragmentState (this=0x35c78f40,
> query_state=0x7972db00, fragment=...,
> fragment_ctx=<error reading variable: Cannot access memory at address
> 0x35c78f88>) at fragment-state.cc:143
> #2 0x00000000013019aa in impala::FragmentState::CreateFragmentStateMap
> (fragment_info=..., exec_request=...,
> state=state@entry=0x7972db00, fragment_map=...) at fragment-state.cc:47
> #3 0x0000000001292d71 in impala::QueryState::StartFInstances
> (this=this@entry=0x7972db00) at query-state.cc:820
> #4 0x0000000001284810 in impala::QueryExecMgr::ExecuteQueryHelper
> (this=0x11943b00, qs=0x7972db00)
> at query-exec-mgr.cc:162
> #5 0x0000000001752915 in operator() (this=0x7f86cec7ab40)
> at
> ../../../toolchain/toolchain-packages-gcc7.5.0/boost-1.61.0-p2/include/boost/function/function_template.hpp:770
> #6 impala::Thread::SuperviseThread(std::__cxx11::basic_string<char,
> std::char_traits<char>, std::allocator<char> > const&,
> std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>
> > const&, boost::function<void ()>, impala::ThreadDebugInfo const*,
> impala::Promise<long, (impala::PromiseMode)0>*) (name=..., category=...,
> functor=...,
> parent_thread_info=<optimized out>, thread_started=0x7f87b7b9acb0) at
> thread.cc:360
> #7 0x0000000001753c9b in operator()<void (*)(const
> std::__cxx11::basic_string<char>&, const std::__cxx11::basic_string<char>&,
> boost::function<void()>, const impala::ThreadDebugInfo*, impala::Promise<long
> int>*), boost::_bi::list0> (
> a=<synthetic pointer>, f=@0x1f66f3b8: <error reading variable>,
> this=0x1f66f3c0)
> at
> ../../../toolchain/toolchain-packages-gcc7.5.0/boost-1.61.0-p2/include/boost/bind/bind.hpp:531
> #8 operator() (this=0x1f66f3b8)
> at
> ../../../toolchain/toolchain-packages-gcc7.5.0/boost-1.61.0-p2/include/boost/bind/bind.hpp:1222
> #9 boost::detail::thread_data<boost::_bi::bind_t<void, void
> (*)(std::__cxx11::basic_string<char, std::char_traits<char>,
> std::allocator<char> > const&, std::__cxx11::basic_string<char,
> std::char_traits<char>, std::allocator<char> > const&, boost::function<void
> ()>, impala::ThreadDebugInfo const*, impala::Promise<long,
> (impala::PromiseMode)0>*),
> boost::_bi::list5<boost::_bi::value<std::__cxx11::basic_string<char,
> std::char_traits<char>, std::allocator<char> > >,
> boost::_bi::value<std::__cxx11::basic_string<char, std::char_traits<char>,
> std::allocator<char> > >, boost::_bi::value<boost::function<void ()> >,
> boost::_bi::value<impala::ThreadDebugInfo*>,
> boost::_bi::value<impala::Promise<long, (impala::PromiseMode)0>*> > >
> >::run() (this=0x1f66f200)
> at
> ../../../toolchain/toolchain-packages-gcc7.5.0/boost-1.61.0-p2/include/boost/thread/detail/thread.hpp:116
> #10 0x0000000001fb4322 in thread_proxy ()
> #11 0x00007f98af288ea5 in start_thread () from /lib64/libpthread.so.0
> #12 0x00007f98ac2dfb0d in gnu_dev_makedev () from /lib64/libc.so.6
> #13 0x0000000000000000 in ?? ()
> {code}
> Note that this issue happened when extra loads were added to the Impala
> cluster. It caused large RPC failures.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)