[ 
https://issues.apache.org/jira/browse/HBASE-13238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14361479#comment-14361479
 ] 

zhangduo edited comment on HBASE-13238 at 3/14/15 1:44 AM:
-----------------------------------------------------------

{quote}
Even if we get hung up at the HDFS level, it's our problem, we can't just have 
indefinite unavailability in some known circumstances.
{quote}
Agree.
I know sometimes it is hard to make HDFS support us(HBASE-5940, almost 3 years 
and no progress...), they have their own plan.
But suggest to keep the work around code only in critical places and strongly 
document why we do this. 


was (Author: apache9):
{noformat}
Even if we get hung up at the HDFS level, it's our problem, we can't just have 
indefinite unavailability in some known circumstances.
{noformat}
Agree.
I know sometimes it is hard to make HDFS support us(HBASE-5940, almost 3 years 
and no progress...), they have their own plan.
But suggest to keep the work around code only in critical places and strongly 
document why we do this. 

> Time out locks and abort if HDFS is wedged
> ------------------------------------------
>
>                 Key: HBASE-13238
>                 URL: https://issues.apache.org/jira/browse/HBASE-13238
>             Project: HBase
>          Issue Type: Brainstorming
>            Reporter: Andrew Purtell
>
> This is a brainstorming issue on the topic of timing out locks and aborting 
> rather than waiting infinitely. Perhaps even as a rule.
> We had a minor production incident where a region was unable to close after 
> trying for 24 hours. The CloseRegionHandler was waiting for a write lock on 
> the ReentrantReadWriteLock we take in HRegion#doClose. There were outstanding 
> read locks. Three other threads were stuck in scanning, all blocked on the 
> same DFSInputStream. Two were blocked in DFSInputStream#getFileLength, the 
> third was waiting in epoll from SocketIOWithTimeout$SelectorPool#select with 
> apparent infinite timeout from PacketReceiver#readChannelFully.
> This is similar to other issues we have seen before, in the context of the 
> region wanting to finish a compaction before closing for a split, but can't 
> due to some HDFS issue causing the reader to become extremely slow if not 
> wedged. This has lead to what should be quick SplitTransactions causing 
> availability problems of many minutes in length.
> The Hadoop version was 2.3 (specifically 2.3 CDH 5.0.1), and we are planning 
> to upgrade, but [~lhofhansl] and I were discussing the issue in general and 
> wonder if we should not be timing out locks such as the 
> ReentrantReadWriteLock, and if so, abort the regionserver. In this case this 
> would have caused recovery and reassignment of the region in question and we 
> would not have had a prolonged availability problem. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to