[ 
https://issues.apache.org/jira/browse/HBASE-13592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14520081#comment-14520081
 ] 

Lars Hofhansl commented on HBASE-13592:
---------------------------------------

I know I suggested the fix. Now looking at the code I see that we only sync the 
WAL when the CF is setup with SKIP_WAL or ASYNC_WALL, so we shouldn't have done 
the sync'ing (see HRegion.internalFlushCache). Maybe it's rather the wait for 
the previous MVCC transactions that times out?

In either case. Moving the try up this way will cover any failures and release 
the flush barrier, so we should do this. Also [~vik.karma] verified 
experimentally that this solves the issues we've been seeing with stuck region 
servers.

Will commit in a bit. Thanks for doing all the detective work, Vikas!

> RegionServer sometimes gets stuck during shutdown in case of cache flush 
> failures
> ---------------------------------------------------------------------------------
>
>                 Key: HBASE-13592
>                 URL: https://issues.apache.org/jira/browse/HBASE-13592
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.98.10
>            Reporter: Vikas Vishwakarma
>            Assignee: Vikas Vishwakarma
>             Fix For: 0.98.13
>
>         Attachments: HBASE-13592-0.98.patch
>
>
> Observed that RegionServer sometimes gets stuck during shutdown in case of 
> cache flush failures. On adding few debug logs and looking through the stack 
> trace RegionServer process looks stuck in closeWAL -> hlog.close -> 
> closeBarrier.stopAndDrainOps(); during the shutdown sequence in the run method
> From the RegionServer logs we see there are multiple attempts to flush cache 
> for a particular region which increments the beginOp count in DrainBarrier 
> but all the flush attempts fails somewhere in wal sync and the DrainBarrier 
> endOp count decrement never happens. Later on when shutdown is initiated 
> RegionServer process is permanently stuck here
> In this case hbase stop also does not work and RegionServer process has to be 
> explicitly killed using kill -9



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to