[ 
https://issues.apache.org/jira/browse/KUDU-1863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16055909#comment-16055909
 ] 

Alexey Serbin edited comment on KUDU-1863 at 6/20/17 3:16 PM:
--------------------------------------------------------------

This issue is not fixed yet.  The same deadlock happened just recently.  That 
was 3-master configuration, and the leader master was shutting down when it 
received TS heartbeat and started processing it, issuing a write operation to 
the system catalog table:

{noformat}
Thread 403 (Thread 0x7f45d1107700 (LWP 9485)):                                  
#0  pthread_cond_wait@@GLIBC_2.3.2 () at 
../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:
#1  0x00007f45de9d2bad in kudu::ConditionVariable::Wait (this=0x7f45d1105918) 
at /data/jenkins-works
#2  0x0000000000429da1 in kudu::CountDownLatch::Wait (this=0x7f45d11058e0) at 
/data/jenkins-workspac
#3  0x00007f45e8324413 in kudu::master::SysCatalogTable::SyncWrite 
(this=0x15c6400, req=0x7f45d1105a
#4  0x00007f45e83250ce in kudu::master::SysCatalogTable::Write (this=0x15c6400, 
actions=...) at /dat
#5  0x00007f45e82b665d in kudu::master::CatalogManager::HandleReportedTablet 
(this=0x3582a00, ts_des
#6  0x00007f45e82b43b7 in kudu::master::CatalogManager::ProcessTabletReport 
(this=0x3582a00, ts_desc
#7  0x00007f45e8309af9 in kudu::master::MasterServiceImpl::TSHeartbeat 
(this=0x1e81130, req=0x24620f
#8  0x00007f45e312e438 in 
kudu::master::MasterServiceIf::MasterServiceIf(scoped_refptr<kudu::MetricE
#9  0x00007f45e31337d7 in std::_Function_handler<void(const 
google::protobuf::Message*, google::prot
#10 0x00007f45e12ddcfa in std::function<void (google::protobuf::Message const*, 
google::protobuf::Me
#11 0x00007f45e12dd495 in kudu::rpc::GeneratedServiceIf::Handle 
(this=0x1e81130, call=0x14332c0) at
#12 0x00007f45e12e017b in kudu::rpc::ServicePool::RunThread (this=0x17c4400) at 
/data/jenkins-worksp
{noformat}

The follower masters sent replies to the master, acknowledging the write 
operation, and two threads stuck in {{kudu::rpc::Messenger::QueueInboundCall}} 
in attempt to process their responses:
{noformat}
Thread 407 (Thread 0x7f45d993f700 (LWP 9472)):                                  
#0  0x00007f45e3c60b9d in nanosleep () at ../sysdeps/unix/syscall-template.S:81 
#1  0x00007f45e82c60fc in boost::detail::yield (k=27778558) at 
/data/jenkins-workspace/kudu-test-cdh
#2  0x00007f45e82c662d in kudu::rw_semaphore::lock_shared (this=0x28c5908) at 
/data/jenkins-workspac
#3  0x00007f45e82c69a4 in kudu::rw_spinlock::lock_shared (this=0x28c5908) at 
/data/jenkins-workspace
#4  0x00007f45e82d4f68 in kudu::shared_lock<kudu::rw_spinlock>::shared_lock 
(this=0x7f45d993e4c0, m=
#5  0x00007f45e12870e0 in kudu::rpc::Messenger::QueueInboundCall 
(this=0x15605a0, call=...) at /data
#6  0x00007f45e1272c00 in kudu::rpc::Connection::HandleIncomingCall 
(this=0x32ede00, transfer=...) a
#7  0x00007f45e127242c in kudu::rpc::Connection::ReadHandler (this=0x32ede00, 
watcher=..., revents=1
#8  0x00007f45e12782bc in ev::base<ev_io, 
ev::io>::method_thunk<kudu::rpc::Connection, &kudu::rpc::C
#9  0x00007f45dd5764fd in ev_invoke_pending (loop=0x22ddb00) at 
/data/jenkins-workspace/kudu-test-cd
#10 0x00007f45dd577403 in ev_run (loop=0x22ddb00, flags=0) at 
/data/jenkins-workspace/kudu-test-cdh5
#11 0x00007f45e12a32f5 in ev::loop_ref::run (this=0x383d548, flags=0) at 
/data/jenkins-workspace/kud
                                                                                
Thread 405 (Thread 0x7f45d893d700 (LWP 9474)):                                  
#0  0x00007f45e3c60b9d in nanosleep () at ../sysdeps/unix/syscall-template.S:81 
#1  0x00007f45e82c60fc in boost::detail::yield (k=27783826) at 
/data/jenkins-workspace/kudu-test-cdh
#2  0x00007f45e82c662d in kudu::rw_semaphore::lock_shared (this=0x28c5948) at 
/data/jenkins-workspac
#3  0x00007f45e82c69a4 in kudu::rw_spinlock::lock_shared (this=0x28c5948) at 
/data/jenkins-workspace
#4  0x00007f45e82d4f68 in kudu::shared_lock<kudu::rw_spinlock>::shared_lock 
(this=0x7f45d893c4c0, m=
#5  0x00007f45e12870e0 in kudu::rpc::Messenger::QueueInboundCall 
(this=0x15605a0, call=...) at /data
#6  0x00007f45e1272c00 in kudu::rpc::Connection::HandleIncomingCall 
(this=0x357a3c0, transfer=...) a
#7  0x00007f45e127242c in kudu::rpc::Connection::ReadHandler (this=0x357a3c0, 
watcher=..., revents=1
#8  0x00007f45e12782bc in ev::base<ev_io, 
ev::io>::method_thunk<kudu::rpc::Connection, &kudu::rpc::C
#9  0x00007f45dd5764fd in ev_invoke_pending (loop=0x22ded00) at 
/data/jenkins-workspace/kudu-test-cd
#10 0x00007f45dd577403 in ev_run (loop=0x22ded00, flags=0) at 
/data/jenkins-workspace/kudu-test-cdh5
#11 0x00007f45e12a32f5 in ev::loop_ref::run (this=0x383de48, flags=0) at 
/data/jenkins-workspace/kud
{noformat}

So, the first thread (Thread 403) is stuck while waiting for the response, 
which would be processed if the 2nd (Thread 407) and 3rd (Thread 405) threads 
were not stuck (actually, it would be enough to have at least one unblocked to 
get the majority for the operation acknowledgment).  The 2nd and 3rd threads 
are blocked because the 4th thread (Thread 1) runs the shutdown and awaits for 
the ServicePool to shutdown:
{noformat}
Thread 1 (Thread 0x7f45e9899880 (LWP 4807)):                                    
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at 
../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_ti
#1  0x00007f45de9d2e30 in kudu::ConditionVariable::TimedWait (this=0x1ec7430, 
max_time=...) at /data
#2  0x0000000000429e20 in kudu::CountDownLatch::WaitFor (this=0x1ec73f8, 
delta=...) at /data/jenkins
#3  0x00007f45deab8070 in kudu::ThreadJoiner::Join (this=0x7fff0b292460) at 
/data/jenkins-workspace/
#4  0x00007f45e12defbf in kudu::rpc::ServicePool::Shutdown (this=0x17c4400) at 
/data/jenkins-workspa
#5  0x00007f45e12deab2 in kudu::rpc::ServicePool::~ServicePool (this=0x17c4400, 
__in_chrg=<optimized
#6  0x00007f45e12deb66 in kudu::rpc::ServicePool::~ServicePool (this=0x17c4400, 
__in_chrg=<optimized
#7  0x00007f45e63c8010 in kudu::RefCountedThreadSafe<kudu::rpc::RpcService, 
kudu::DefaultRefCountedT
#8  0x00007f45e63c78be in 
kudu::DefaultRefCountedThreadSafeTraits<kudu::rpc::RpcService>::Destruct (
#9  0x00007f45e63c6e10 in kudu::RefCountedThreadSafe<kudu::rpc::RpcService, 
kudu::DefaultRefCountedT
#10 0x00007f45e63c6395 in scoped_refptr<kudu::rpc::RpcService>::~scoped_refptr 
(this=0x1681f10, __in
#11 0x00007f45e128aade in std::pair<std::string const, 
scoped_refptr<kudu::rpc::RpcService> >::~pair
#12 0x00007f45e1290336 in __gnu_cxx::new_allocator<std::pair<std::string const, 
scoped_refptr<kudu::
#13 0x00007f45e128fa71 in 
std::allocator_traits<std::allocator<std::pair<std::string const, scoped_r
#14 0x00007f45e128ee93 in 
std::allocator_traits<std::allocator<std::pair<std::string const, scoped_r
#15 0x00007f45e128df61 in 
std::__detail::_Hashtable_alloc<std::allocator<std::__detail::_Hash_node<s
#16 0x00007f45e128cd13 in 
std::__detail::_Hashtable_alloc<std::allocator<std::__detail::_Hash_node<s
#17 0x00007f45e128bac0 in std::_Hashtable<std::string, std::pair<std::string 
const, scoped_refptr<ku
#18 0x00007f45e128a5ce in std::unordered_map<std::string, 
scoped_refptr<kudu::rpc::RpcService>, std:
#19 0x00007f45e1286e51 in kudu::rpc::Messenger::UnregisterAllServices 
(this=0x15605a0) at /data/jenk
#20 0x00007f45e63c50cc in kudu::RpcServer::Shutdown (this=0x2199d40) at 
/data/jenkins-workspace/kudu
#21 0x00007f45e63cdd07 in kudu::server::ServerBase::Shutdown (this=0x1f13e00) 
at /data/jenkins-works
#22 0x00007f45e8300fd8 in kudu::master::Master::Shutdown (this=0x1f13e00) at 
/data/jenkins-workspace
{noformat}

The 4th thread (Thread 1) is blocked because it awaits for the fist thread 
(Thread 403) to complete its task and join.

This deadlock happens rarely because it depends on the 'perfect timing' for 
sending out the request and calling the 
{{kudu::rpc::Messenger::UnregisterAllServices}} initiated by 
{{kudu::master::Master::Shutdown}}.


was (Author: aserbin):
This issue is not fixed yet.  The same deadlock happened just recently.  That 
was 3-master configuration, and the leader master was shutting down when it 
received TS heartbeat and started processing it, issuing a write operation to 
the system catalog table:

{noformat}
Thread 403 (Thread 0x7f45d1107700 (LWP 9485)):                                  
#0  pthread_cond_wait@@GLIBC_2.3.2 () at 
../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:
#1  0x00007f45de9d2bad in kudu::ConditionVariable::Wait (this=0x7f45d1105918) 
at /data/jenkins-works
#2  0x0000000000429da1 in kudu::CountDownLatch::Wait (this=0x7f45d11058e0) at 
/data/jenkins-workspac
#3  0x00007f45e8324413 in kudu::master::SysCatalogTable::SyncWrite 
(this=0x15c6400, req=0x7f45d1105a
#4  0x00007f45e83250ce in kudu::master::SysCatalogTable::Write (this=0x15c6400, 
actions=...) at /dat
#5  0x00007f45e82b665d in kudu::master::CatalogManager::HandleReportedTablet 
(this=0x3582a00, ts_des
#6  0x00007f45e82b43b7 in kudu::master::CatalogManager::ProcessTabletReport 
(this=0x3582a00, ts_desc
#7  0x00007f45e8309af9 in kudu::master::MasterServiceImpl::TSHeartbeat 
(this=0x1e81130, req=0x24620f
#8  0x00007f45e312e438 in 
kudu::master::MasterServiceIf::MasterServiceIf(scoped_refptr<kudu::MetricE
#9  0x00007f45e31337d7 in std::_Function_handler<void(const 
google::protobuf::Message*, google::prot
#10 0x00007f45e12ddcfa in std::function<void (google::protobuf::Message const*, 
google::protobuf::Me
#11 0x00007f45e12dd495 in kudu::rpc::GeneratedServiceIf::Handle 
(this=0x1e81130, call=0x14332c0) at
#12 0x00007f45e12e017b in kudu::rpc::ServicePool::RunThread (this=0x17c4400) at 
/data/jenkins-worksp
{noformat}

The follower masters sent replies to the master, acknowledging the write 
operation, and two threads stuck in {{kudu::rpc::Messenger::QueueInboundCall}} 
in attempt to process their responses:
{noformat}
Thread 407 (Thread 0x7f45d993f700 (LWP 9472)):                                  
#0  0x00007f45e3c60b9d in nanosleep () at ../sysdeps/unix/syscall-template.S:81 
#1  0x00007f45e82c60fc in boost::detail::yield (k=27778558) at 
/data/jenkins-workspace/kudu-test-cdh
#2  0x00007f45e82c662d in kudu::rw_semaphore::lock_shared (this=0x28c5908) at 
/data/jenkins-workspac
#3  0x00007f45e82c69a4 in kudu::rw_spinlock::lock_shared (this=0x28c5908) at 
/data/jenkins-workspace
#4  0x00007f45e82d4f68 in kudu::shared_lock<kudu::rw_spinlock>::shared_lock 
(this=0x7f45d993e4c0, m=
#5  0x00007f45e12870e0 in kudu::rpc::Messenger::QueueInboundCall 
(this=0x15605a0, call=...) at /data
#6  0x00007f45e1272c00 in kudu::rpc::Connection::HandleIncomingCall 
(this=0x32ede00, transfer=...) a
#7  0x00007f45e127242c in kudu::rpc::Connection::ReadHandler (this=0x32ede00, 
watcher=..., revents=1
#8  0x00007f45e12782bc in ev::base<ev_io, 
ev::io>::method_thunk<kudu::rpc::Connection, &kudu::rpc::C
#9  0x00007f45dd5764fd in ev_invoke_pending (loop=0x22ddb00) at 
/data/jenkins-workspace/kudu-test-cd
#10 0x00007f45dd577403 in ev_run (loop=0x22ddb00, flags=0) at 
/data/jenkins-workspace/kudu-test-cdh5
#11 0x00007f45e12a32f5 in ev::loop_ref::run (this=0x383d548, flags=0) at 
/data/jenkins-workspace/kud
                                                                                
Thread 405 (Thread 0x7f45d893d700 (LWP 9474)):                                  
#0  0x00007f45e3c60b9d in nanosleep () at ../sysdeps/unix/syscall-template.S:81 
#1  0x00007f45e82c60fc in boost::detail::yield (k=27783826) at 
/data/jenkins-workspace/kudu-test-cdh
#2  0x00007f45e82c662d in kudu::rw_semaphore::lock_shared (this=0x28c5948) at 
/data/jenkins-workspac
#3  0x00007f45e82c69a4 in kudu::rw_spinlock::lock_shared (this=0x28c5948) at 
/data/jenkins-workspace
#4  0x00007f45e82d4f68 in kudu::shared_lock<kudu::rw_spinlock>::shared_lock 
(this=0x7f45d893c4c0, m=
#5  0x00007f45e12870e0 in kudu::rpc::Messenger::QueueInboundCall 
(this=0x15605a0, call=...) at /data
#6  0x00007f45e1272c00 in kudu::rpc::Connection::HandleIncomingCall 
(this=0x357a3c0, transfer=...) a
#7  0x00007f45e127242c in kudu::rpc::Connection::ReadHandler (this=0x357a3c0, 
watcher=..., revents=1
#8  0x00007f45e12782bc in ev::base<ev_io, 
ev::io>::method_thunk<kudu::rpc::Connection, &kudu::rpc::C
#9  0x00007f45dd5764fd in ev_invoke_pending (loop=0x22ded00) at 
/data/jenkins-workspace/kudu-test-cd
#10 0x00007f45dd577403 in ev_run (loop=0x22ded00, flags=0) at 
/data/jenkins-workspace/kudu-test-cdh5
#11 0x00007f45e12a32f5 in ev::loop_ref::run (this=0x383de48, flags=0) at 
/data/jenkins-workspace/kud
{noformat}

So, the first thread (Thread 403) is stuck while waiting for the response, 
which would be processed if the 2nd (Thread 407) and 3rd (Thread 405) threads 
were not stuck (actually, it would be enough to have at least one unblocked to 
get the majority for the operation acknowledgment).  The 2nd and 3rd threads a 
blocked because the 4th thread (Thread 1) runs the shutdown and awaits for the 
ServicePool to shutdown:
{noformat}
Thread 1 (Thread 0x7f45e9899880 (LWP 4807)):                                    
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at 
../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_ti
#1  0x00007f45de9d2e30 in kudu::ConditionVariable::TimedWait (this=0x1ec7430, 
max_time=...) at /data
#2  0x0000000000429e20 in kudu::CountDownLatch::WaitFor (this=0x1ec73f8, 
delta=...) at /data/jenkins
#3  0x00007f45deab8070 in kudu::ThreadJoiner::Join (this=0x7fff0b292460) at 
/data/jenkins-workspace/
#4  0x00007f45e12defbf in kudu::rpc::ServicePool::Shutdown (this=0x17c4400) at 
/data/jenkins-workspa
#5  0x00007f45e12deab2 in kudu::rpc::ServicePool::~ServicePool (this=0x17c4400, 
__in_chrg=<optimized
#6  0x00007f45e12deb66 in kudu::rpc::ServicePool::~ServicePool (this=0x17c4400, 
__in_chrg=<optimized
#7  0x00007f45e63c8010 in kudu::RefCountedThreadSafe<kudu::rpc::RpcService, 
kudu::DefaultRefCountedT
#8  0x00007f45e63c78be in 
kudu::DefaultRefCountedThreadSafeTraits<kudu::rpc::RpcService>::Destruct (
#9  0x00007f45e63c6e10 in kudu::RefCountedThreadSafe<kudu::rpc::RpcService, 
kudu::DefaultRefCountedT
#10 0x00007f45e63c6395 in scoped_refptr<kudu::rpc::RpcService>::~scoped_refptr 
(this=0x1681f10, __in
#11 0x00007f45e128aade in std::pair<std::string const, 
scoped_refptr<kudu::rpc::RpcService> >::~pair
#12 0x00007f45e1290336 in __gnu_cxx::new_allocator<std::pair<std::string const, 
scoped_refptr<kudu::
#13 0x00007f45e128fa71 in 
std::allocator_traits<std::allocator<std::pair<std::string const, scoped_r
#14 0x00007f45e128ee93 in 
std::allocator_traits<std::allocator<std::pair<std::string const, scoped_r
#15 0x00007f45e128df61 in 
std::__detail::_Hashtable_alloc<std::allocator<std::__detail::_Hash_node<s
#16 0x00007f45e128cd13 in 
std::__detail::_Hashtable_alloc<std::allocator<std::__detail::_Hash_node<s
#17 0x00007f45e128bac0 in std::_Hashtable<std::string, std::pair<std::string 
const, scoped_refptr<ku
#18 0x00007f45e128a5ce in std::unordered_map<std::string, 
scoped_refptr<kudu::rpc::RpcService>, std:
#19 0x00007f45e1286e51 in kudu::rpc::Messenger::UnregisterAllServices 
(this=0x15605a0) at /data/jenk
#20 0x00007f45e63c50cc in kudu::RpcServer::Shutdown (this=0x2199d40) at 
/data/jenkins-workspace/kudu
#21 0x00007f45e63cdd07 in kudu::server::ServerBase::Shutdown (this=0x1f13e00) 
at /data/jenkins-works
#22 0x00007f45e8300fd8 in kudu::master::Master::Shutdown (this=0x1f13e00) at 
/data/jenkins-workspace
{noformat}

The 4th thread (Thread 1) is blocked because it awaits for the fist thread 
(Thread 403) to complete its task and join.

This deadlock happens rarely because it depends on the 'perfect timing' for 
sending out the request and calling the 
{{kudu::rpc::Messenger::UnregisterAllServices}} initiated by 
{{kudu::master::Master::Shutdown}}.

> Deadlock while shutting down master
> -----------------------------------
>
>                 Key: KUDU-1863
>                 URL: https://issues.apache.org/jira/browse/KUDU-1863
>             Project: Kudu
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 1.3.0
>            Reporter: Todd Lipcon
>             Fix For: Backlog
>
>         Attachments: ksck_remote-test.txt.bz2
>
>
> I saw ksck_remote-test fail because the master was hanging during its attempt 
> to shut down. Analysis follows.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to