[ 
https://issues.apache.org/jira/browse/KUDU-1863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin reopened KUDU-1863:
---------------------------------

This issue is not fixed yet.  The same deadlock happened just recently.  That 
was 3-master configuration, and the leader master was shutting down when it 
received TS heartbeat and started processing it, issuing a write operation to 
the system catalog table:

{noformat}
Thread 403 (Thread 0x7f45d1107700 (LWP 9485)):                                  
#0  pthread_cond_wait@@GLIBC_2.3.2 () at 
../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:
#1  0x00007f45de9d2bad in kudu::ConditionVariable::Wait (this=0x7f45d1105918) 
at /data/jenkins-works
#2  0x0000000000429da1 in kudu::CountDownLatch::Wait (this=0x7f45d11058e0) at 
/data/jenkins-workspac
#3  0x00007f45e8324413 in kudu::master::SysCatalogTable::SyncWrite 
(this=0x15c6400, req=0x7f45d1105a
#4  0x00007f45e83250ce in kudu::master::SysCatalogTable::Write (this=0x15c6400, 
actions=...) at /dat
#5  0x00007f45e82b665d in kudu::master::CatalogManager::HandleReportedTablet 
(this=0x3582a00, ts_des
#6  0x00007f45e82b43b7 in kudu::master::CatalogManager::ProcessTabletReport 
(this=0x3582a00, ts_desc
#7  0x00007f45e8309af9 in kudu::master::MasterServiceImpl::TSHeartbeat 
(this=0x1e81130, req=0x24620f
#8  0x00007f45e312e438 in 
kudu::master::MasterServiceIf::MasterServiceIf(scoped_refptr<kudu::MetricE
#9  0x00007f45e31337d7 in std::_Function_handler<void(const 
google::protobuf::Message*, google::prot
#10 0x00007f45e12ddcfa in std::function<void (google::protobuf::Message const*, 
google::protobuf::Me
#11 0x00007f45e12dd495 in kudu::rpc::GeneratedServiceIf::Handle 
(this=0x1e81130, call=0x14332c0) at
#12 0x00007f45e12e017b in kudu::rpc::ServicePool::RunThread (this=0x17c4400) at 
/data/jenkins-worksp
#13 0x00007f45e12e15db in boost::_mfi::mf0<void, 
kudu::rpc::ServicePool>::operator() (this=0x1ec73e0
#14 0x00007f45e12e1402 in 
boost::_bi::list1<boost::_bi::value<kudu::rpc::ServicePool*> >::operator()
#15 0x00007f45e12e12f7 in boost::_bi::bind_t<void, boost::_mfi::mf0<void, 
kudu::rpc::ServicePool>, b
#16 0x00007f45e12e122e in 
boost::detail::function::void_function_obj_invoker0<boost::_bi::bind_t<voi
#17 0x00007f45e1299ee6 in boost::function0<void>::operator() (this=0x1ec73d8) 
at /data/jenkins-works
#18 0x00007f45deab9229 in kudu::Thread::SuperviseThread (arg=0x1ec73b0) at 
/data/jenkins-workspace/k
#19 0x00007f45e3c59184 in start_thread (arg=0x7f45d1107700) at 
pthread_create.c:312
#20 0x00007f45db91337d in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:111
{noformat}

The follower master sent replies to the master, acknowledging the write 
operation, and two threads stuck in {{kudu::rpc::Messenger::QueueInboundCall}}:
{noformat}
Thread 407 (Thread 0x7f45d993f700 (LWP 9472)):                                  
#0  0x00007f45e3c60b9d in nanosleep () at ../sysdeps/unix/syscall-template.S:81 
#1  0x00007f45e82c60fc in boost::detail::yield (k=27778558) at 
/data/jenkins-workspace/kudu-test-cdh
#2  0x00007f45e82c662d in kudu::rw_semaphore::lock_shared (this=0x28c5908) at 
/data/jenkins-workspac
#3  0x00007f45e82c69a4 in kudu::rw_spinlock::lock_shared (this=0x28c5908) at 
/data/jenkins-workspace
#4  0x00007f45e82d4f68 in kudu::shared_lock<kudu::rw_spinlock>::shared_lock 
(this=0x7f45d993e4c0, m=
#5  0x00007f45e12870e0 in kudu::rpc::Messenger::QueueInboundCall 
(this=0x15605a0, call=...) at /data
#6  0x00007f45e1272c00 in kudu::rpc::Connection::HandleIncomingCall 
(this=0x32ede00, transfer=...) a
#7  0x00007f45e127242c in kudu::rpc::Connection::ReadHandler (this=0x32ede00, 
watcher=..., revents=1
#8  0x00007f45e12782bc in ev::base<ev_io, 
ev::io>::method_thunk<kudu::rpc::Connection, &kudu::rpc::C
#9  0x00007f45dd5764fd in ev_invoke_pending (loop=0x22ddb00) at 
/data/jenkins-workspace/kudu-test-cd
#10 0x00007f45dd577403 in ev_run (loop=0x22ddb00, flags=0) at 
/data/jenkins-workspace/kudu-test-cdh5
#11 0x00007f45e12a32f5 in ev::loop_ref::run (this=0x383d548, flags=0) at 
/data/jenkins-workspace/kud
#12 0x00007f45e129fce0 in kudu::rpc::ReactorThread::RunThread (this=0x383d540) 
at /data/jenkins-work
#13 0x00007f45e12ab14d in boost::_mfi::mf0<void, 
kudu::rpc::ReactorThread>::operator() (this=0x2cdbb
#14 0x00007f45e12aa9e8 in 
boost::_bi::list1<boost::_bi::value<kudu::rpc::ReactorThread*> >::operator
#15 0x00007f45e12aa153 in boost::_bi::bind_t<void, boost::_mfi::mf0<void, 
kudu::rpc::ReactorThread>,
#16 0x00007f45e12a94e6 in 
boost::detail::function::void_function_obj_invoker0<boost::_bi::bind_t<voi
#17 0x00007f45e1299ee6 in boost::function0<void>::operator() (this=0x2cdbb58) 
at /data/jenkins-works
#18 0x00007f45deab9229 in kudu::Thread::SuperviseThread (arg=0x2cdbb30) at 
/data/jenkins-workspace/k
#19 0x00007f45e3c59184 in start_thread (arg=0x7f45d993f700) at 
pthread_create.c:312
#20 0x00007f45db91337d in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:111
                                                                                
Thread 405 (Thread 0x7f45d893d700 (LWP 9474)):                                  
#0  0x00007f45e3c60b9d in nanosleep () at ../sysdeps/unix/syscall-template.S:81 
#1  0x00007f45e82c60fc in boost::detail::yield (k=27783826) at 
/data/jenkins-workspace/kudu-test-cdh
#2  0x00007f45e82c662d in kudu::rw_semaphore::lock_shared (this=0x28c5948) at 
/data/jenkins-workspac
#3  0x00007f45e82c69a4 in kudu::rw_spinlock::lock_shared (this=0x28c5948) at 
/data/jenkins-workspace
#4  0x00007f45e82d4f68 in kudu::shared_lock<kudu::rw_spinlock>::shared_lock 
(this=0x7f45d893c4c0, m=
#5  0x00007f45e12870e0 in kudu::rpc::Messenger::QueueInboundCall 
(this=0x15605a0, call=...) at /data
#6  0x00007f45e1272c00 in kudu::rpc::Connection::HandleIncomingCall 
(this=0x357a3c0, transfer=...) a
#7  0x00007f45e127242c in kudu::rpc::Connection::ReadHandler (this=0x357a3c0, 
watcher=..., revents=1
#8  0x00007f45e12782bc in ev::base<ev_io, 
ev::io>::method_thunk<kudu::rpc::Connection, &kudu::rpc::C
#9  0x00007f45dd5764fd in ev_invoke_pending (loop=0x22ded00) at 
/data/jenkins-workspace/kudu-test-cd
#10 0x00007f45dd577403 in ev_run (loop=0x22ded00, flags=0) at 
/data/jenkins-workspace/kudu-test-cdh5
#11 0x00007f45e12a32f5 in ev::loop_ref::run (this=0x383de48, flags=0) at 
/data/jenkins-workspace/kud
#12 0x00007f45e129fce0 in kudu::rpc::ReactorThread::RunThread (this=0x383de40) 
at /data/jenkins-work
#13 0x00007f45e12ab14d in boost::_mfi::mf0<void, 
kudu::rpc::ReactorThread>::operator() (this=0x1c3f5
#14 0x00007f45e12aa9e8 in 
boost::_bi::list1<boost::_bi::value<kudu::rpc::ReactorThread*> >::operator
#15 0x00007f45e12aa153 in boost::_bi::bind_t<void, boost::_mfi::mf0<void, 
kudu::rpc::ReactorThread>,
#16 0x00007f45e12a94e6 in 
boost::detail::function::void_function_obj_invoker0<boost::_bi::bind_t<voi
#17 0x00007f45e1299ee6 in boost::function0<void>::operator() (this=0x1c3f5b8) 
at /data/jenkins-works
#18 0x00007f45deab9229 in kudu::Thread::SuperviseThread (arg=0x1c3f590) at 
/data/jenkins-workspace/k
#19 0x00007f45e3c59184 in start_thread (arg=0x7f45d893d700) at 
pthread_create.c:312
#20 0x00007f45db91337d in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:111
{noformat}

So, the first thread (Thread 403) is stuck while waiting for the response, 
which would be processed if the 2nd (Thread 407) and 3rd (Thread 405) threads 
were not stuck (actually, it would be enough to have at least one unblocked to 
get the majority for the operation acknowledgment).  The 2nd and 3rd threads a 
blocked because the 4th thread (Thread 1) runs the shutdown and awaits for the 
ServicePool to shutdown:
{noformat}
Thread 1 (Thread 0x7f45e9899880 (LWP 4807)):                                    
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at 
../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_ti
#1  0x00007f45de9d2e30 in kudu::ConditionVariable::TimedWait (this=0x1ec7430, 
max_time=...) at /data
#2  0x0000000000429e20 in kudu::CountDownLatch::WaitFor (this=0x1ec73f8, 
delta=...) at /data/jenkins
#3  0x00007f45deab8070 in kudu::ThreadJoiner::Join (this=0x7fff0b292460) at 
/data/jenkins-workspace/
#4  0x00007f45e12defbf in kudu::rpc::ServicePool::Shutdown (this=0x17c4400) at 
/data/jenkins-workspa
#5  0x00007f45e12deab2 in kudu::rpc::ServicePool::~ServicePool (this=0x17c4400, 
__in_chrg=<optimized
#6  0x00007f45e12deb66 in kudu::rpc::ServicePool::~ServicePool (this=0x17c4400, 
__in_chrg=<optimized
#7  0x00007f45e63c8010 in kudu::RefCountedThreadSafe<kudu::rpc::RpcService, 
kudu::DefaultRefCountedT
#8  0x00007f45e63c78be in 
kudu::DefaultRefCountedThreadSafeTraits<kudu::rpc::RpcService>::Destruct (
#9  0x00007f45e63c6e10 in kudu::RefCountedThreadSafe<kudu::rpc::RpcService, 
kudu::DefaultRefCountedT
#10 0x00007f45e63c6395 in scoped_refptr<kudu::rpc::RpcService>::~scoped_refptr 
(this=0x1681f10, __in
#11 0x00007f45e128aade in std::pair<std::string const, 
scoped_refptr<kudu::rpc::RpcService> >::~pair
#12 0x00007f45e1290336 in __gnu_cxx::new_allocator<std::pair<std::string const, 
scoped_refptr<kudu::
#13 0x00007f45e128fa71 in 
std::allocator_traits<std::allocator<std::pair<std::string const, scoped_r
#14 0x00007f45e128ee93 in 
std::allocator_traits<std::allocator<std::pair<std::string const, scoped_r
#15 0x00007f45e128df61 in 
std::__detail::_Hashtable_alloc<std::allocator<std::__detail::_Hash_node<s
#16 0x00007f45e128cd13 in 
std::__detail::_Hashtable_alloc<std::allocator<std::__detail::_Hash_node<s
#17 0x00007f45e128bac0 in std::_Hashtable<std::string, std::pair<std::string 
const, scoped_refptr<ku
#18 0x00007f45e128a5ce in std::unordered_map<std::string, 
scoped_refptr<kudu::rpc::RpcService>, std:
#19 0x00007f45e1286e51 in kudu::rpc::Messenger::UnregisterAllServices 
(this=0x15605a0) at /data/jenk
#20 0x00007f45e63c50cc in kudu::RpcServer::Shutdown (this=0x2199d40) at 
/data/jenkins-workspace/kudu
#21 0x00007f45e63cdd07 in kudu::server::ServerBase::Shutdown (this=0x1f13e00) 
at /data/jenkins-works
#22 0x00007f45e8300fd8 in kudu::master::Master::Shutdown (this=0x1f13e00) at 
/data/jenkins-workspace
#23 0x00007f45e831dab0 in kudu::master::MiniMaster::Shutdown (this=0x32af680) 
at /data/jenkins-works
#24 0x0000000000428225 in 
kudu::tools::RemoteKsckTest_TestLeaderMasterDown_Test::TestBody (this=0x14
#25 0x00007f45dc851af8 in HandleSehExceptionsInMethodIfSupported<testing::Test, 
void> (location=0x7f
#26 testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void> 
(object=object@entry
#27 0x00007f45dc845fc2 in testing::Test::Run (this=0x1432000) at 
/data/jenkins-workspace/kudu-test-c
#28 0x00007f45dc846108 in testing::TestInfo::Run (this=0x13a86e0) at 
/data/jenkins-workspace/kudu-te
#29 0x00007f45dc8461e5 in testing::TestCase::Run (this=0x13f6000) at 
/data/jenkins-workspace/kudu-te
#30 0x00007f45dc8464c8 in testing::internal::UnitTestImpl::RunAllTests 
(this=0x13c8200) at /data/jen
#31 0x00007f45dc852008 in 
HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bo
#32 
testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl,
 bool> (o
#33 0x00007f45dc8467ad in testing::UnitTest::Run (this=0x7f45dca772c0 
<testing::UnitTest::GetInstanc
#34 0x00007f45e8f51e55 in RUN_ALL_TESTS () at 
/data/jenkins-workspace/kudu-test-cdh5.12/thirdparty/i
#35 0x00007f45e8f4fc80 in main (argc=1, argv=0x7fff0b292e18) at 
/data/jenkins-workspace/kudu-test-cd
#36 0x00007f45db83af45 in __libc_start_main (main=0x422b80 <main@plt>, argc=3, 
argv=0x7fff0b292e08,
#37 0x0000000000423529 in _start ()
{noformat}

The 4th thread (Thread 1) is blocked because it awaits for the fist thread 
(Thread 403) to complete its task and join.

This deadlock happens rarely because it depends on the 'perfect timing' for 
sending out the request and calling the 
{{kudu::rpc::Messenger::UnregisterAllServices}} initiated by 
{{kudu::master::Master::Shutdown}}.

> Deadlock while shutting down master
> -----------------------------------
>
>                 Key: KUDU-1863
>                 URL: https://issues.apache.org/jira/browse/KUDU-1863
>             Project: Kudu
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 1.3.0
>            Reporter: Todd Lipcon
>             Fix For: Backlog
>
>
> I saw ksck_remote-test fail because the master was hanging during its attempt 
> to shut down. Analysis follows.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to