[
https://issues.apache.org/jira/browse/HBASE-13238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14361479#comment-14361479
]
zhangduo edited comment on HBASE-13238 at 3/14/15 1:44 AM:
-----------------------------------------------------------
{quote}
Even if we get hung up at the HDFS level, it's our problem, we can't just have
indefinite unavailability in some known circumstances.
{quote}
Agree.
I know sometimes it is hard to make HDFS support us(HBASE-5940, almost 3 years
and no progress...), they have their own plan.
But suggest to keep the work around code only in critical places and strongly
document why we do this.
was (Author: apache9):
{noformat}
Even if we get hung up at the HDFS level, it's our problem, we can't just have
indefinite unavailability in some known circumstances.
{noformat}
Agree.
I know sometimes it is hard to make HDFS support us(HBASE-5940, almost 3 years
and no progress...), they have their own plan.
But suggest to keep the work around code only in critical places and strongly
document why we do this.
> Time out locks and abort if HDFS is wedged
> ------------------------------------------
>
> Key: HBASE-13238
> URL: https://issues.apache.org/jira/browse/HBASE-13238
> Project: HBase
> Issue Type: Brainstorming
> Reporter: Andrew Purtell
>
> This is a brainstorming issue on the topic of timing out locks and aborting
> rather than waiting infinitely. Perhaps even as a rule.
> We had a minor production incident where a region was unable to close after
> trying for 24 hours. The CloseRegionHandler was waiting for a write lock on
> the ReentrantReadWriteLock we take in HRegion#doClose. There were outstanding
> read locks. Three other threads were stuck in scanning, all blocked on the
> same DFSInputStream. Two were blocked in DFSInputStream#getFileLength, the
> third was waiting in epoll from SocketIOWithTimeout$SelectorPool#select with
> apparent infinite timeout from PacketReceiver#readChannelFully.
> This is similar to other issues we have seen before, in the context of the
> region wanting to finish a compaction before closing for a split, but can't
> due to some HDFS issue causing the reader to become extremely slow if not
> wedged. This has lead to what should be quick SplitTransactions causing
> availability problems of many minutes in length.
> The Hadoop version was 2.3 (specifically 2.3 CDH 5.0.1), and we are planning
> to upgrade, but [~lhofhansl] and I were discussing the issue in general and
> wonder if we should not be timing out locks such as the
> ReentrantReadWriteLock, and if so, abort the regionserver. In this case this
> would have caused recovery and reassignment of the region in question and we
> would not have had a prolonged availability problem.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)