Todd Lipcon created KUDU-2193:
---------------------------------

             Summary: Severe lock contention on TSTabletManager lock
                 Key: KUDU-2193
                 URL: https://issues.apache.org/jira/browse/KUDU-2193
             Project: Kudu
          Issue Type: Bug
          Components: tserver
    Affects Versions: 1.6.0
            Reporter: Todd Lipcon
            Assignee: Todd Lipcon
            Priority: Critical


I'm doing some stress/failure testing on a cluster with lots of tablets and ran 
into the following mess:

- TSTabletManager::GenerateIncrementalTabletReport is holding the 
TSTabletManager lock in 'read' mode
-- it's calling CreateReportedTabletPB on a bunch of tablets which are in the 
process of an election storm
-- each such call blocks in RaftConsensus::ConsensusState since it's in the 
process of fsyncing metadata to disk
-- thus it's holding the read lock on TSTabletManager lock for a long time 
(many seconds if not tens of seconds)
- meanwhile, some other thread is trying to take TSTabletManager::lock for 
write, and blocked due to the above reader
- rw_spinlock is writer-starvation-free which means that no more readers can 
acquire the lock

What's worse is that rw_spinlock is a true spin lock, so now there are tens of 
threads in a 'while (true) sched_yield()' loop, generating over 1.5M context 
switches per second.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to