Vikas Vishwakarma created HBASE-13418:
-----------------------------------------
Summary: Regions getting stuck in PENDING_CLOSE state infinitely
Key: HBASE-13418
URL: https://issues.apache.org/jira/browse/HBASE-13418
Project: HBase
Issue Type: Bug
Affects Versions: 0.98.10
Reporter: Vikas Vishwakarma
In some heavy data load cases when there are multiple RegionServers going
up/down (HA) or when we try to shutdown/restart the entire HBase cluster, we
are observing that some regions are getting stuck in PENDING_CLOSE state
infinitely.
On going through the logs for a particular region stuck in PENDING_CLOSE state,
it looks like for this region two memstore flush got triggered within few
milliseconds as given below and after sometime there is Unrecoverable exception
while closing region. I am suspecting this could be some kind of race condition
but need to check further
Logs:
================
......
2015-04-06 11:47:33,309 INFO [2,queue=0,port=60020] regionserver.HRegionServer
- Close 884fd5819112370d9a9834895b0ec19c, via zk=yes, znode version=0, on
blitzhbase01-dnds1-4-crd.eng.sfdc.net,60020,1428318111711
2015-04-06 11:47:33,309 DEBUG [-dnds3-4-crd:60020-0] handler.CloseRegionHandler
- Processing close of
RMHA_1,\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1428318937003.884fd5819112370d9a9834895b0ec19c.
2015-04-06 11:47:33,319 DEBUG [-dnds3-4-crd:60020-0] regionserver.HRegion -
Closing
RMHA_1,\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1428318937003.884fd5819112370d9a9834895b0ec19c.:
disabling compactions & flushes
2015-04-06 11:47:33,319 INFO [-dnds3-4-crd:60020-0] regionserver.HRegion -
Running close preflush of
RMHA_1,\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1428318937003.884fd5819112370d9a9834895b0ec19c.
2015-04-06 11:47:33,319 INFO [-dnds3-4-crd:60020-0] regionserver.HRegion -
Started memstore flush for
RMHA_1,\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1428318937003.884fd5819112370d9a9834895b0ec19c.,
current region memstore size 70.0 M
2015-04-06 11:47:33,327 DEBUG [-dnds3-4-crd:60020-0] regionserver.HRegion -
Updates disabled for region
RMHA_1,\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1428318937003.884fd5819112370d9a9834895b0ec19c.
2015-04-06 11:47:33,328 INFO [-dnds3-4-crd:60020-0] regionserver.HRegion -
Started memstore flush for
RMHA_1,\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1428318937003.884fd5819112370d9a9834895b0ec19c.,
current region memstore size 70.0 M
2015-04-06 11:47:33,328 WARN [-dnds3-4-crd:60020-0] wal.FSHLog - Couldn't find
oldest seqNum for the region we are about to flush:
[884fd5819112370d9a9834895b0ec19c]
2015-04-06 11:47:33,328 WARN [-dnds3-4-crd:60020-0] regionserver.MemStore -
Snapshot called again without clearing previous. Doing nothing. Another ongoing
flush or did we fail last attempt?
2015-04-06 11:47:33,334 FATAL [-dnds3-4-crd:60020-0] regionserver.HRegionServer
- ABORTING region server
blitzhbase01-dnds3-4-crd.eng.sfdc.net,60020,1428318082860: Unrecoverable
exception while closing region
RMHA_1,\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1428318937003.884fd5819112370d9a9834895b0ec19c.,
still finishing close
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)