[
https://issues.apache.org/jira/browse/HBASE-13592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14519016#comment-14519016
]
Vikas Vishwakarma commented on HBASE-13592:
-------------------------------------------
tried a patch with changes suggested by [~lhofhansl] wherein we move wal.sync
also with the try catch block of cache flush, and in this case all the affected
RegionServers successfully shutdown without any RegionServer going into hung
state. The RegionServers that don't shutdown are fully operational and working
fine.
Current implementation in HRegion.java
{noformat}
protected FlushResult internalFlushcache(
...
try {
..
this.updatesLock.writeLock().lock();
..
if (!wal.startCacheFlush(this.getRegionInfo().getEncodedNameAsBytes())) {
<----------- this will do DrainBarrier beginOp
..
} finally {
this.updatesLock.writeLock().unlock();
}
...
if (wal != null && !shouldSyncLog()) {
wal.sync(); <----- this is currently outside the try catch block for flush
cache below and is added inside the try catch block in the submitted patch
}
mvcc.waitForRead(w);
...
try {
...
flush cache code
...
} catch (Throwable t) {
if (wal != null) {
wal.abortCacheFlush(this.getRegionInfo().getEncodedNameAsBytes()); <---
this will do DrainBarrier.endOp()
}
}
......
// If we get to here, the HStores have been written.
if (wal != null) {
wal.completeCacheFlush(this.getRegionInfo().getEncodedNameAsBytes()); <---
this will do DrainBarrier.endOp()
}
{noformat}
The patch submitted contains the following changes
{noformat}
protected FlushResult internalFlushcache(
...
try {
..
this.updatesLock.writeLock().lock();
..
if (!wal.startCacheFlush(this.getRegionInfo().getEncodedNameAsBytes())) {
<----------- this will do DrainBarrier beginOp
..
} finally {
this.updatesLock.writeLock().unlock();
}
...
try {
if (wal != null && !shouldSyncLog()) {
wal.sync(); <----- included in the flush cache try catch block, any
exceptions here will also call abortCacheFlush in the catch block which will
decrement the op count in DrainBarrier
}
mvcc.waitForRead(w);
...
flush cache code
...
} catch (Throwable t) {
if (wal != null) {
wal.abortCacheFlush(this.getRegionInfo().getEncodedNameAsBytes()); <---
this will do DrainBarrier.endOp()
}
}
......
// If we get to here, the HStores have been written.
if (wal != null) {
wal.completeCacheFlush(this.getRegionInfo().getEncodedNameAsBytes()); <---
this will do DrainBarrier.endOp()
}
{noformat}
> RegionServer sometimes gets stuck during shutdown in case of cache flush
> failures
> ---------------------------------------------------------------------------------
>
> Key: HBASE-13592
> URL: https://issues.apache.org/jira/browse/HBASE-13592
> Project: HBase
> Issue Type: Bug
> Affects Versions: 0.98.10
> Reporter: Vikas Vishwakarma
> Assignee: Vikas Vishwakarma
>
> Observed that RegionServer sometimes gets stuck during shutdown in case of
> cache flush failures. On adding few debug logs and looking through the stack
> trace RegionServer process looks stuck in closeWAL -> hlog.close ->
> closeBarrier.stopAndDrainOps(); during the shutdown sequence in the run method
> From the RegionServer logs we see there are multiple attempts to flush cache
> for a particular region which increments the beginOp count in DrainBarrier
> but all the flush attempts fails somewhere in wal sync and the DrainBarrier
> endOp count decrement never happens. Later on when shutdown is initiated
> RegionServer process is permanently stuck here
> In this case hbase stop also does not work and RegionServer process has to be
> explicitly killed using kill -9
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)