[jira] [Updated] (HBASE-13238) Time out locks and abort if HDFS is wedged

Andrew Purtell (JIRA) Fri, 13 Mar 2015 15:03:13 -0700

     [ 
https://issues.apache.org/jira/browse/HBASE-13238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Andrew Purtell updated HBASE-13238:
-----------------------------------
    Description: 
This is a brainstorming issue on the topic of timing out locks and aborting 
rather than waiting infinitely. Perhaps even as a rule.

We had a minor production incident where a region was unable to close after 
trying for 24 hours. The CloseRegionHandler was waiting for a write lock on the 
ReentrantReadWriteLock we take in HRegion#doClose. There were outstanding read 
locks. Three other threads were stuck in scanning, all blocked on the same 
DFSInputStream. Two were blocked in DFSInputStream#getFileLength, the third was 
waiting in epoll from SocketIOWithTimeout$SelectorPool#select with apparent 
infinite timeout from PacketReceiver#readChannelFully.

This is similar to other issues we have seen before, in the context of the 
region wanting to finish a compaction before closing for a split, but can't due 
to some HDFS issue causing the reader to become extremely slow if not wedged. 
This has lead to what should be quick SplitTransactions causing availability 
problems of many minutes in length.

The Hadoop version was 2.3 (specifically 2.3 CDH 5.0.1), and we are planning to 
upgrade, but [~lhofhansl] and I were discussing the issue in general and wonder 
if we should not be timing out locks such as the ReentrantReadWriteLock, and if 
so, abort the regionserver. In this case this would have caused recovery and 
reassignment of the region in question and we would not have had a prolonged 
availability problem. 

  was:
This is a brainstorming issue on the topic of timing out locks and aborting 
rather than waiting infinitely. Perhaps even as a rule.

We had a minor production incident where a region was unable to close after 
trying for 24 hours. The CloseRegionHandler was waiting for a write lock on the 
ReentrantReadWriteLock we take in HRegion#doClose. There were outstanding read 
locks. Three other threads were stuck in scanning, all blocked on the same 
DFSInputStream. Two were blocked in DFSInputStream#getFileLength, the third was 
waiting in epoll from SocketIOWithTimeout$SelectorPool#select with apparent 
infinite timeout from PacketReceiver#readChannelFully.

This is similar to other issues we have seen before, in the context of the 
region wanting to finish a compaction, but can't due to some HDFS issue causing 
the reader to become extremely slow if not wedged.

The Hadoop version was 2.3 (specifically 2.3 CDH 5.0.1), and we are planning to 
upgrade, but [~lhofhansl] and I were discussing the issue in general and wonder 
if we should not be timing out locks such as the ReentrantReadWriteLock, and if 
so, abort the regionserver. In this case this would have caused recovery and 
reassignment of the region in question and we would not have had a prolonged 
availability problem. 


> Time out locks and abort if HDFS is wedged
> ------------------------------------------
>
>                 Key: HBASE-13238
>                 URL: https://issues.apache.org/jira/browse/HBASE-13238
>             Project: HBase
>          Issue Type: Brainstorming
>            Reporter: Andrew Purtell
>
> This is a brainstorming issue on the topic of timing out locks and aborting 
> rather than waiting infinitely. Perhaps even as a rule.
> We had a minor production incident where a region was unable to close after 
> trying for 24 hours. The CloseRegionHandler was waiting for a write lock on 
> the ReentrantReadWriteLock we take in HRegion#doClose. There were outstanding 
> read locks. Three other threads were stuck in scanning, all blocked on the 
> same DFSInputStream. Two were blocked in DFSInputStream#getFileLength, the 
> third was waiting in epoll from SocketIOWithTimeout$SelectorPool#select with 
> apparent infinite timeout from PacketReceiver#readChannelFully.
> This is similar to other issues we have seen before, in the context of the 
> region wanting to finish a compaction before closing for a split, but can't 
> due to some HDFS issue causing the reader to become extremely slow if not 
> wedged. This has lead to what should be quick SplitTransactions causing 
> availability problems of many minutes in length.
> The Hadoop version was 2.3 (specifically 2.3 CDH 5.0.1), and we are planning 
> to upgrade, but [~lhofhansl] and I were discussing the issue in general and 
> wonder if we should not be timing out locks such as the 
> ReentrantReadWriteLock, and if so, abort the regionserver. In this case this 
> would have caused recovery and reassignment of the region in question and we 
> would not have had a prolonged availability problem. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-13238) Time out locks and abort if HDFS is wedged

Reply via email to