Mostafa Mokhtar created IMPALA-6788:
---------------------------------------
Summary: Query fragments can spend lots of time starting up then
fail right after "starting" all backends
Key: IMPALA-6788
URL: https://issues.apache.org/jira/browse/IMPALA-6788
Project: IMPALA
Issue Type: Sub-task
Components: Distributed Exec
Reporter: Mostafa Mokhtar
Attachments: connect_thread_busy_queries_failing.txt
Logs from a large cluster show that query startup can take a long time, then
once the startup completes the query is cancelled, this is because one of the
intermediate rpcs failed.
Not clear what the right answer is as fragments are started asynchronously,
possibly a timeout?
{code}
I0401 21:25:30.776803 1830900 coordinator.cc:99] Exec()
query_id=334cc7dd9758c36c:ec38aeb400000000 stmt=with customer_total_return as
I0401 21:25:30.813993 1830900 coordinator.cc:357] starting execution on 644
backends for query_id=334cc7dd9758c36c:ec38aeb400000000
I0401 21:29:58.406466 1830900 coordinator.cc:370] started execution on 644
backends for query_id=334cc7dd9758c36c:ec38aeb400000000
I0401 21:29:58.412132 1830900 coordinator.cc:896] Cancel()
query_id=334cc7dd9758c36c:ec38aeb400000000
I0401 21:29:59.188817 1830900 coordinator.cc:906] CancelBackends()
query_id=334cc7dd9758c36c:ec38aeb400000000, tried to cancel 643 backends
I0401 21:29:59.189177 1830900 coordinator.cc:1092] Release admission control
resources for query_id=334cc7dd9758c36c:ec38aeb400000000
{code}
{code}
I0401 21:23:48.218379 1830386 coordinator.cc:99] Exec()
query_id=e44d553b04d47cfb:28f06bb800000000 stmt=with customer_total_return as
I0401 21:23:48.270226 1830386 coordinator.cc:357] starting execution on 640
backends for query_id=e44d553b04d47cfb:28f06bb800000000
I0401 21:29:58.402195 1830386 coordinator.cc:370] started execution on 640
backends for query_id=e44d553b04d47cfb:28f06bb800000000
I0401 21:29:58.403818 1830386 coordinator.cc:896] Cancel()
query_id=e44d553b04d47cfb:28f06bb800000000
I0401 21:29:59.255903 1830386 coordinator.cc:906] CancelBackends()
query_id=e44d553b04d47cfb:28f06bb800000000, tried to cancel 639 backends
I0401 21:29:59.256251 1830386 coordinator.cc:1092] Release admission control
resources for query_id=e44d553b04d47cfb:28f06bb800000000
{code}
Checked the coordinator and threads appear to be spending lots of time waiting
on exec_complete_barrier_
{code}
#0 0x00007fd928c816d5 in pthread_cond_wait@@GLIBC_2.3.2 () from
/lib64/libpthread.so.0
#1 0x0000000001222944 in impala::Promise<bool>::Get() ()
#2 0x0000000001220d7b in impala::Coordinator::StartBackendExec() ()
#3 0x0000000001221c87 in impala::Coordinator::Exec() ()
#4 0x0000000000c3a925 in
impala::ClientRequestState::ExecQueryOrDmlRequest(impala::TQueryExecRequest
const&) ()
#5 0x0000000000c41f7e in
impala::ClientRequestState::Exec(impala::TExecRequest*) ()
#6 0x0000000000bff597 in
impala::ImpalaServer::ExecuteInternal(impala::TQueryCtx const&,
std::shared_ptr<impala::ImpalaServer::SessionState>, bool*,
std::shared_ptr<impala::ClientRequestState>*) ()
#7 0x0000000000c061d9 in impala::ImpalaServer::Execute(impala::TQueryCtx*,
std::shared_ptr<impala::ImpalaServer::SessionState>,
std::shared_ptr<impala::ClientRequestState>*) ()
#8 0x0000000000c561c5 in impala::ImpalaServer::query(beeswax::QueryHandle&,
beeswax::Query const&) ()
/StartBackendExec
#11 0x0000000000d60c9a in boost::detail::thread_data<boost::_bi::bind_t<void,
void (*)(std::string const&, std::string const&, boost::function<void ()>,
impala::ThreadDebugInfo const*, impala::Promise<long>*),
boost::_bi::list5<boost::_bi::value<std::string>,
boost::_bi::value<std::string>, boost::_bi::value<boost::function<void ()> >,
boost::_bi::value<impala::ThreadDebugInfo*>,
boost::_bi::value<impala::Promise<long>*> > > >::run() ()
{code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)