[jira] [Resolved] (HBASE-13238) Time out locks and abort if HDFS is wedged

Andrew Kyle Purtell (Jira) Sat, 11 Jun 2022 12:41:26 -0700


     [ 
https://issues.apache.org/jira/browse/HBASE-13238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Andrew Kyle Purtell resolved HBASE-13238.
-----------------------------------------
    Resolution: Won't Fix

> Time out locks and abort if HDFS is wedged
> ------------------------------------------
>
>                 Key: HBASE-13238
>                 URL: https://issues.apache.org/jira/browse/HBASE-13238
>             Project: HBase
>          Issue Type: Brainstorming
>            Reporter: Andrew Kyle Purtell
>            Priority: Major
>
> This is a brainstorming issue on the topic of timing out locks and aborting 
> rather than waiting infinitely. Perhaps even as a rule.
> We had a minor production incident where a region was unable to close after 
> trying for 24 hours. The CloseRegionHandler was waiting for a write lock on 
> the ReentrantReadWriteLock we take in HRegion#doClose. There were outstanding 
> read locks. Three other threads were stuck in scanning, all blocked on the 
> same DFSInputStream. Two were blocked in DFSInputStream#getFileLength, the 
> third was waiting in epoll from SocketIOWithTimeout$SelectorPool#select with 
> apparent infinite timeout from PacketReceiver#readChannelFully.
> This is similar to other issues we have seen before, in the context of the 
> region wanting to finish a compaction before closing for a split, but can't 
> due to some HDFS issue causing the reader to become extremely slow if not 
> wedged. This has lead to what should be quick SplitTransactions causing 
> availability problems of many minutes in length.
> The Hadoop version was 2.3 (specifically 2.3 CDH 5.0.1), and we are planning 
> to upgrade, but [~lhofhansl] and I were discussing the issue in general and 
> wonder if we should not be timing out locks such as the 
> ReentrantReadWriteLock, and if so, abort the regionserver. In this case this 
> would have caused recovery and reassignment of the region in question and we 
> would not have had a prolonged availability problem. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Resolved] (HBASE-13238) Time out locks and abort if HDFS is wedged

Reply via email to