[
https://issues.apache.org/jira/browse/HBASE-13418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14507195#comment-14507195
]
Esteban Gutierrez commented on HBASE-13418:
-------------------------------------------
That is very interesting. I think I saw something like that long time ago if
using Hadoop 2.0 and SCRs. Since we need to read back the snapshot to verify it
was written to HDFS and then it hangs since the underlying DN is gone. In newer
versions of Hadoop I haven't seen it, perhaps related?
> Regions getting stuck in PENDING_CLOSE state infinitely in high load HA
> scenarios
> ---------------------------------------------------------------------------------
>
> Key: HBASE-13418
> URL: https://issues.apache.org/jira/browse/HBASE-13418
> Project: HBase
> Issue Type: Bug
> Affects Versions: 0.98.10
> Reporter: Vikas Vishwakarma
>
> In some heavy data load cases when there are multiple RegionServers going
> up/down (HA) or when we try to shutdown/restart the entire HBase cluster, we
> are observing that some regions are getting stuck in PENDING_CLOSE state
> infinitely.
> On going through the logs for a particular region stuck in PENDING_CLOSE
> state, it looks like for this region two memstore flush got triggered within
> few milliseconds as given below and after sometime there is Unrecoverable
> exception while closing region. I am suspecting this could be some kind of
> race condition but need to check further
> Logs:
> ================
> ......
> 2015-04-06 11:47:33,309 INFO [2,queue=0,port=60020]
> regionserver.HRegionServer - Close 884fd5819112370d9a9834895b0ec19c, via
> zk=yes, znode version=0, on
> blitzhbase01-dnds1-4-crd.eng.sfdc.net,60020,1428318111711
> 2015-04-06 11:47:33,309 DEBUG [-dnds3-4-crd:60020-0]
> handler.CloseRegionHandler - Processing close of
> RMHA_1,\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1428318937003.884fd5819112370d9a9834895b0ec19c.
> 2015-04-06 11:47:33,319 DEBUG [-dnds3-4-crd:60020-0] regionserver.HRegion -
> Closing
> RMHA_1,\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1428318937003.884fd5819112370d9a9834895b0ec19c.:
> disabling compactions & flushes
> 2015-04-06 11:47:33,319 INFO [-dnds3-4-crd:60020-0] regionserver.HRegion -
> Running close preflush of
> RMHA_1,\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1428318937003.884fd5819112370d9a9834895b0ec19c.
> 2015-04-06 11:47:33,319 INFO [-dnds3-4-crd:60020-0] regionserver.HRegion -
> Started memstore flush for
> RMHA_1,\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1428318937003.884fd5819112370d9a9834895b0ec19c.,
> current region memstore size 70.0 M
> 2015-04-06 11:47:33,327 DEBUG [-dnds3-4-crd:60020-0] regionserver.HRegion -
> Updates disabled for region
> RMHA_1,\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1428318937003.884fd5819112370d9a9834895b0ec19c.
> 2015-04-06 11:47:33,328 INFO [-dnds3-4-crd:60020-0] regionserver.HRegion -
> Started memstore flush for
> RMHA_1,\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1428318937003.884fd5819112370d9a9834895b0ec19c.,
> current region memstore size 70.0 M
> 2015-04-06 11:47:33,328 WARN [-dnds3-4-crd:60020-0] wal.FSHLog - Couldn't
> find oldest seqNum for the region we are about to flush:
> [884fd5819112370d9a9834895b0ec19c]
> 2015-04-06 11:47:33,328 WARN [-dnds3-4-crd:60020-0] regionserver.MemStore -
> Snapshot called again without clearing previous. Doing nothing. Another
> ongoing flush or did we fail last attempt?
> 2015-04-06 11:47:33,334 FATAL [-dnds3-4-crd:60020-0]
> regionserver.HRegionServer - ABORTING region server
> blitzhbase01-dnds3-4-crd.eng.sfdc.net,60020,1428318082860: Unrecoverable
> exception while closing region
> RMHA_1,\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1428318937003.884fd5819112370d9a9834895b0ec19c.,
> still finishing close
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)