Todd Lipcon has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/8346 )
Change subject: KUDU-2193 (part 2): avoid holding TSTabletManager::lock_ for a long time ...................................................................... KUDU-2193 (part 2): avoid holding TSTabletManager::lock_ for a long time This changes tablet report generation to only hold the TSTabletManager lock long enough to copy a list of refs to the relevant tablets. Even though the lock is a reader-writer lock, a long read-side critical section can end up blocking other readers as long as any writer shows up. I saw the following in a cluster: - election storm ongoing - T1 generating a tablet report (thus holding the reader lock) - blocks for a long time getting ConsensusState from some tablets currently mid-fsync. - T2 handling a DeleteTablet or CreateTablet call (waiting for writer lock) - T3 through T20: blocking on reader lock in LookupTablet() The effect here is that all other threads end up blocked until T1 finishes its tablet report generation, which in this case can be tens of seconds due to all of the fsync traffic. These blocked threads then contribute to the ongoing election storm since they may delay timely responses to vote requests from other tablets. I tested this and the previous patch on a cluster that was previously experiencing the issue. I triggered some elections by kill -STOPping some servers and resuming them a few seconds later. Without these patches, other servers started doing lots of context switches (due to the spinlock used prior to this patch series) and I saw lots of blocked threads in pstack. With the patch, things seem cleaner. The following screenshot shows before/after voluntary_context_switch rate: https://i.imgur.com/7zcbImw.png (the missing data around 5pm is my restarting the servers with this patch) Change-Id: I0f645775e8347a18af112b308dba56a3b4a2c681 Reviewed-on: http://gerrit.cloudera.org:8080/8346 Tested-by: Kudu Jenkins Reviewed-by: Todd Lipcon <t...@apache.org> --- M src/kudu/tserver/ts_tablet_manager.cc M src/kudu/tserver/ts_tablet_manager.h 2 files changed, 31 insertions(+), 18 deletions(-) Approvals: Kudu Jenkins: Verified Todd Lipcon: Looks good to me, approved -- To view, visit http://gerrit.cloudera.org:8080/8346 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: I0f645775e8347a18af112b308dba56a3b4a2c681 Gerrit-Change-Number: 8346 Gerrit-PatchSet: 3 Gerrit-Owner: Todd Lipcon <t...@apache.org> Gerrit-Reviewer: Adar Dembo <a...@cloudera.com> Gerrit-Reviewer: David Ribeiro Alves <davidral...@gmail.com> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon <t...@apache.org>