[
https://issues.apache.org/jira/browse/HBASE-13418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14506464#comment-14506464
]
Vikas Vishwakarma commented on HBASE-13418:
-------------------------------------------
[~esteban], also in the disruptive case the issue is that the regions continue
to be stuck forever even after all the DataNodes are back up and the HDFS layer
has recovered completely. I am checking with DFS timeout fix provided by
[~apurtell] which is a clone of HDFS-7005, for the reproducible case first,
then possibly check if it fixes other similar scenarios also.
> Regions getting stuck in PENDING_CLOSE state infinitely in high load HA
> scenarios
> ---------------------------------------------------------------------------------
>
> Key: HBASE-13418
> URL: https://issues.apache.org/jira/browse/HBASE-13418
> Project: HBase
> Issue Type: Bug
> Affects Versions: 0.98.10
> Reporter: Vikas Vishwakarma
>
> In some heavy data load cases when there are multiple RegionServers going
> up/down (HA) or when we try to shutdown/restart the entire HBase cluster, we
> are observing that some regions are getting stuck in PENDING_CLOSE state
> infinitely.
> On going through the logs for a particular region stuck in PENDING_CLOSE
> state, it looks like for this region two memstore flush got triggered within
> few milliseconds as given below and after sometime there is Unrecoverable
> exception while closing region. I am suspecting this could be some kind of
> race condition but need to check further
> Logs:
> ================
> ......
> 2015-04-06 11:47:33,309 INFO [2,queue=0,port=60020]
> regionserver.HRegionServer - Close 884fd5819112370d9a9834895b0ec19c, via
> zk=yes, znode version=0, on
> blitzhbase01-dnds1-4-crd.eng.sfdc.net,60020,1428318111711
> 2015-04-06 11:47:33,309 DEBUG [-dnds3-4-crd:60020-0]
> handler.CloseRegionHandler - Processing close of
> RMHA_1,\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1428318937003.884fd5819112370d9a9834895b0ec19c.
> 2015-04-06 11:47:33,319 DEBUG [-dnds3-4-crd:60020-0] regionserver.HRegion -
> Closing
> RMHA_1,\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1428318937003.884fd5819112370d9a9834895b0ec19c.:
> disabling compactions & flushes
> 2015-04-06 11:47:33,319 INFO [-dnds3-4-crd:60020-0] regionserver.HRegion -
> Running close preflush of
> RMHA_1,\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1428318937003.884fd5819112370d9a9834895b0ec19c.
> 2015-04-06 11:47:33,319 INFO [-dnds3-4-crd:60020-0] regionserver.HRegion -
> Started memstore flush for
> RMHA_1,\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1428318937003.884fd5819112370d9a9834895b0ec19c.,
> current region memstore size 70.0 M
> 2015-04-06 11:47:33,327 DEBUG [-dnds3-4-crd:60020-0] regionserver.HRegion -
> Updates disabled for region
> RMHA_1,\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1428318937003.884fd5819112370d9a9834895b0ec19c.
> 2015-04-06 11:47:33,328 INFO [-dnds3-4-crd:60020-0] regionserver.HRegion -
> Started memstore flush for
> RMHA_1,\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1428318937003.884fd5819112370d9a9834895b0ec19c.,
> current region memstore size 70.0 M
> 2015-04-06 11:47:33,328 WARN [-dnds3-4-crd:60020-0] wal.FSHLog - Couldn't
> find oldest seqNum for the region we are about to flush:
> [884fd5819112370d9a9834895b0ec19c]
> 2015-04-06 11:47:33,328 WARN [-dnds3-4-crd:60020-0] regionserver.MemStore -
> Snapshot called again without clearing previous. Doing nothing. Another
> ongoing flush or did we fail last attempt?
> 2015-04-06 11:47:33,334 FATAL [-dnds3-4-crd:60020-0]
> regionserver.HRegionServer - ABORTING region server
> blitzhbase01-dnds3-4-crd.eng.sfdc.net,60020,1428318082860: Unrecoverable
> exception while closing region
> RMHA_1,\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1428318937003.884fd5819112370d9a9834895b0ec19c.,
> still finishing close
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)