[
https://issues.apache.org/jira/browse/IMPALA-13107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17850525#comment-17850525
]
ASF subversion and git services commented on IMPALA-13107:
----------------------------------------------------------
Commit 3e1b10556bc83b0e697b7a2aac411ccad6094563 in impala's branch
refs/heads/master from wzhou-code
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=3e1b10556 ]
IMPALA-13107: Don't start query on executor if instance number equals 0
In bad networking condition, TExecPlanFragmentInfo in KRPC messages
received by executors could be truncated due to KRPC failures, but
truncation may not cause thrift deserialization error. The invalid
TExecPlanFragmentInfo causes Impala daemon to crash.
To avoid crash, this patch checks number of instances in received
TExecPlanFragment on executor. The query will not be started if number
of instances equals 0. Also adds DCHECK on coordinator side to make
sure it does not send TExecPlanFragment without any instance.
Testing:
- Passed core tests.
- Passed exhaustive tests in debug build. The new DCHECKs were not
hit.
Change-Id: Ie92ee120f1e9369f8dc2512792a05b7f8be5f007
Reviewed-on: http://gerrit.cloudera.org:8080/21458
Reviewed-by: Wenzhe Zhou <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>
> Invalid TExecPlanFragmentInfo received by executor with instance number as 0
> ----------------------------------------------------------------------------
>
> Key: IMPALA-13107
> URL: https://issues.apache.org/jira/browse/IMPALA-13107
> Project: IMPALA
> Issue Type: Bug
> Components: Backend
> Reporter: Wenzhe Zhou
> Assignee: Wenzhe Zhou
> Priority: Major
> Fix For: Impala 4.5.0
>
>
> In a customer reported case, TExecPlanFragmentInfo received by executors with
> instance number equals 0, which caused impala daemon to crash. Here are log
> messages collected on the Impala executors:
> {code:java}
> impalad.executor.net.impala.log.INFO.20240522-160138.197583:I0523
> 00:59:16.892853 199528 control-service.cc:148]
> 624c47e9264ebb62:5aa89af300000000] ExecQueryFInstances():
> query_id=624c47e9264ebb62:5aa89af300000000 coord=coordinator.net:27000
> #instances=0
> ......
> I0523 00:59:19.306522 199185 kMinidump in thread
> [1890723]query-state-624c47e9264ebb62:5aa89af300000000 running query
> 624c47e9264ebb62:5aa89af300000000, fragment instance
> 0000000000000000:0000000000000000
> Wrote minidump to
> /var/log/impala-minidumps/impalad/021b06ea-1627-4c69-9f27858a-f3cd9026.dmp
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> # SIGSEGV (0xb) at pc=0x00000000012ff9d9, pid=197583, tid=0x00007eefc98a0700
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_381) (build 1.8.0_381-b09)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.381-b09 mixed mode
> linux-amd64 )
> # Problematic frame:
> # C [impalad+0xeff9d9]
> impala::FragmentState::FragmentState(impala::QueryState*,
> impala::TPlanFragment const&, impala::PlanFragmentCtxPB const&)+0xf9
> #
> # Failed to write core dump. Core dumps have been disabled. To enable core
> dumping, try "ulimit -c unlimited" before starting Java again
> #
> {code}
> From the collected profiles, there was no fragment with instance number as 0
> in the corresponding query plan so coordinator should not send fragments to
> executor with number of instances as 0. Executor log files showed that there
> were lots of KRPC errors around the time when receiving invalid
> TExecPlanFragmentInfo. It seems KRPC messages were truncated due to KRPC
> failures, but truncation might not cause thrift deserialization error. The
> invalid TExecPlanFragmentInfo caused Impala daemon to crash with following
> stack trace when the query was started on executor.
> {code:java}
> #0 SubstituteArg (value=..., this=0x7f86cec79d30) at
> ../gutil/strings/substitute.h:79
> #1 impala::FragmentState::FragmentState (this=0x35c78f40,
> query_state=0x7972db00, fragment=...,
> fragment_ctx=<error reading variable: Cannot access memory at address
> 0x35c78f88>) at fragment-state.cc:143
> #2 0x00000000013019aa in impala::FragmentState::CreateFragmentStateMap
> (fragment_info=..., exec_request=...,
> state=state@entry=0x7972db00, fragment_map=...) at fragment-state.cc:47
> #3 0x0000000001292d71 in impala::QueryState::StartFInstances
> (this=this@entry=0x7972db00) at query-state.cc:820
> #4 0x0000000001284810 in impala::QueryExecMgr::ExecuteQueryHelper
> (this=0x11943b00, qs=0x7972db00)
> at query-exec-mgr.cc:162
> #5 0x0000000001752915 in operator() (this=0x7f86cec7ab40)
> at
> ../../../toolchain/toolchain-packages-gcc7.5.0/boost-1.61.0-p2/include/boost/function/function_template.hpp:770
> #6 impala::Thread::SuperviseThread(std::__cxx11::basic_string<char,
> std::char_traits<char>, std::allocator<char> > const&,
> std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>
> > const&, boost::function<void ()>, impala::ThreadDebugInfo const*,
> impala::Promise<long, (impala::PromiseMode)0>*) (name=..., category=...,
> functor=...,
> parent_thread_info=<optimized out>, thread_started=0x7f87b7b9acb0) at
> thread.cc:360
> #7 0x0000000001753c9b in operator()<void (*)(const
> std::__cxx11::basic_string<char>&, const std::__cxx11::basic_string<char>&,
> boost::function<void()>, const impala::ThreadDebugInfo*, impala::Promise<long
> int>*), boost::_bi::list0> (
> a=<synthetic pointer>, f=@0x1f66f3b8: <error reading variable>,
> this=0x1f66f3c0)
> at
> ../../../toolchain/toolchain-packages-gcc7.5.0/boost-1.61.0-p2/include/boost/bind/bind.hpp:531
> #8 operator() (this=0x1f66f3b8)
> at
> ../../../toolchain/toolchain-packages-gcc7.5.0/boost-1.61.0-p2/include/boost/bind/bind.hpp:1222
> #9 boost::detail::thread_data<boost::_bi::bind_t<void, void
> (*)(std::__cxx11::basic_string<char, std::char_traits<char>,
> std::allocator<char> > const&, std::__cxx11::basic_string<char,
> std::char_traits<char>, std::allocator<char> > const&, boost::function<void
> ()>, impala::ThreadDebugInfo const*, impala::Promise<long,
> (impala::PromiseMode)0>*),
> boost::_bi::list5<boost::_bi::value<std::__cxx11::basic_string<char,
> std::char_traits<char>, std::allocator<char> > >,
> boost::_bi::value<std::__cxx11::basic_string<char, std::char_traits<char>,
> std::allocator<char> > >, boost::_bi::value<boost::function<void ()> >,
> boost::_bi::value<impala::ThreadDebugInfo*>,
> boost::_bi::value<impala::Promise<long, (impala::PromiseMode)0>*> > >
> >::run() (this=0x1f66f200)
> at
> ../../../toolchain/toolchain-packages-gcc7.5.0/boost-1.61.0-p2/include/boost/thread/detail/thread.hpp:116
> #10 0x0000000001fb4322 in thread_proxy ()
> #11 0x00007f98af288ea5 in start_thread () from /lib64/libpthread.so.0
> #12 0x00007f98ac2dfb0d in gnu_dev_makedev () from /lib64/libc.so.6
> #13 0x0000000000000000 in ?? ()
> {code}
> Note that this issue happened when extra loads were added to the Impala
> cluster. It caused large RPC failures.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]