Viraj Jasani created HBASE-28271:
------------------------------------
Summary: Infinite waiting on lock acquisition by snapshot can
result in unresponsive master
Key: HBASE-28271
URL: https://issues.apache.org/jira/browse/HBASE-28271
Project: HBase
Issue Type: Improvement
Affects Versions: 2.5.7, 2.4.17, 3.0.0-alpha-4
Reporter: Viraj Jasani
Assignee: Viraj Jasani
Attachments: image.png
When a region is stuck in transition for significant time, any attempt to take
snapshot on the table would keep master handler thread in forever waiting
state. As part of the creating snapshot on enabled or disabled table, in order
to get the table level lock, LockProcedure is executed but if any region of the
table is in transition, LockProcedure could not be executed by the snapshot
handler, resulting in forever waiting until the region transition is completed,
allowing the table level lock to be acquired by the snapshot handler.
In cases where a region stays in RIT for considerable time, if enough attempts
are made by the client to create snapshots on the table, it can easily exhaust
all handler threads, leading to potentially unresponsive master. Attached a
sample thread dump.
Proposal: The snapshot handler should not stay stuck forever if it cannot take
table level lock, it should fail-fast.
!image.png!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)