[
https://issues.apache.org/jira/browse/HBASE-13238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andrew Kyle Purtell resolved HBASE-13238.
-----------------------------------------
Resolution: Won't Fix
> Time out locks and abort if HDFS is wedged
> ------------------------------------------
>
> Key: HBASE-13238
> URL: https://issues.apache.org/jira/browse/HBASE-13238
> Project: HBase
> Issue Type: Brainstorming
> Reporter: Andrew Kyle Purtell
> Priority: Major
>
> This is a brainstorming issue on the topic of timing out locks and aborting
> rather than waiting infinitely. Perhaps even as a rule.
> We had a minor production incident where a region was unable to close after
> trying for 24 hours. The CloseRegionHandler was waiting for a write lock on
> the ReentrantReadWriteLock we take in HRegion#doClose. There were outstanding
> read locks. Three other threads were stuck in scanning, all blocked on the
> same DFSInputStream. Two were blocked in DFSInputStream#getFileLength, the
> third was waiting in epoll from SocketIOWithTimeout$SelectorPool#select with
> apparent infinite timeout from PacketReceiver#readChannelFully.
> This is similar to other issues we have seen before, in the context of the
> region wanting to finish a compaction before closing for a split, but can't
> due to some HDFS issue causing the reader to become extremely slow if not
> wedged. This has lead to what should be quick SplitTransactions causing
> availability problems of many minutes in length.
> The Hadoop version was 2.3 (specifically 2.3 CDH 5.0.1), and we are planning
> to upgrade, but [~lhofhansl] and I were discussing the issue in general and
> wonder if we should not be timing out locks such as the
> ReentrantReadWriteLock, and if so, abort the regionserver. In this case this
> would have caused recovery and reassignment of the region in question and we
> would not have had a prolonged availability problem.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)