[
https://issues.apache.org/jira/browse/HBASE-13418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14484660#comment-14484660
]
Andrew Purtell edited comment on HBASE-13418 at 4/8/15 3:43 AM:
----------------------------------------------------------------
Isn't 'w' in local scope? So two threads both executing internalFlushcache will
have distinct private values for 'w' in frames on their own stacks. I don't
think the issue is exactly what you describe, but that's not to say there isn't
a locking protocol problem here somewhere. (FWIW, the 'mvcc' object could be
shared between multiple threads running on the same region and we do have a
synchronized access to the mvcc's pending queue of writes in
beginMemstoreInsert.) Would it be possible to get a complete stack dump from a
regionserver that has a region stuck in PENDING_CLOSE? Use jstack or the dump
servlet http://<rs>:<http-port>/dump
was (Author: apurtell):
Isn't 'w' in local scope? So two threads both executing internalFlushcache will
have distinct private values for 'w' in frames on their own stacks. I don't
think the issue is exactly what you describe, but that's not to say there isn't
a locking protocol problem here somewhere. (FWIW, the 'mvcc' object could be
shared between multiple threads running on the same region and we do have a
synchronized access to the mvcc's pending queue of writes in
beginMemstoreInsert.) Would it be possible to get a complete stack dump from a
regionserver that has a region stuck in PENDING_CLOSE?
> Regions getting stuck in PENDING_CLOSE state infinitely in high load HA
> scenarios
> ---------------------------------------------------------------------------------
>
> Key: HBASE-13418
> URL: https://issues.apache.org/jira/browse/HBASE-13418
> Project: HBase
> Issue Type: Bug
> Affects Versions: 0.98.10
> Reporter: Vikas Vishwakarma
>
> In some heavy data load cases when there are multiple RegionServers going
> up/down (HA) or when we try to shutdown/restart the entire HBase cluster, we
> are observing that some regions are getting stuck in PENDING_CLOSE state
> infinitely.
> On going through the logs for a particular region stuck in PENDING_CLOSE
> state, it looks like for this region two memstore flush got triggered within
> few milliseconds as given below and after sometime there is Unrecoverable
> exception while closing region. I am suspecting this could be some kind of
> race condition but need to check further
> Logs:
> ================
> ......
> 2015-04-06 11:47:33,309 INFO [2,queue=0,port=60020]
> regionserver.HRegionServer - Close 884fd5819112370d9a9834895b0ec19c, via
> zk=yes, znode version=0, on
> blitzhbase01-dnds1-4-crd.eng.sfdc.net,60020,1428318111711
> 2015-04-06 11:47:33,309 DEBUG [-dnds3-4-crd:60020-0]
> handler.CloseRegionHandler - Processing close of
> RMHA_1,\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1428318937003.884fd5819112370d9a9834895b0ec19c.
> 2015-04-06 11:47:33,319 DEBUG [-dnds3-4-crd:60020-0] regionserver.HRegion -
> Closing
> RMHA_1,\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1428318937003.884fd5819112370d9a9834895b0ec19c.:
> disabling compactions & flushes
> 2015-04-06 11:47:33,319 INFO [-dnds3-4-crd:60020-0] regionserver.HRegion -
> Running close preflush of
> RMHA_1,\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1428318937003.884fd5819112370d9a9834895b0ec19c.
> 2015-04-06 11:47:33,319 INFO [-dnds3-4-crd:60020-0] regionserver.HRegion -
> Started memstore flush for
> RMHA_1,\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1428318937003.884fd5819112370d9a9834895b0ec19c.,
> current region memstore size 70.0 M
> 2015-04-06 11:47:33,327 DEBUG [-dnds3-4-crd:60020-0] regionserver.HRegion -
> Updates disabled for region
> RMHA_1,\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1428318937003.884fd5819112370d9a9834895b0ec19c.
> 2015-04-06 11:47:33,328 INFO [-dnds3-4-crd:60020-0] regionserver.HRegion -
> Started memstore flush for
> RMHA_1,\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1428318937003.884fd5819112370d9a9834895b0ec19c.,
> current region memstore size 70.0 M
> 2015-04-06 11:47:33,328 WARN [-dnds3-4-crd:60020-0] wal.FSHLog - Couldn't
> find oldest seqNum for the region we are about to flush:
> [884fd5819112370d9a9834895b0ec19c]
> 2015-04-06 11:47:33,328 WARN [-dnds3-4-crd:60020-0] regionserver.MemStore -
> Snapshot called again without clearing previous. Doing nothing. Another
> ongoing flush or did we fail last attempt?
> 2015-04-06 11:47:33,334 FATAL [-dnds3-4-crd:60020-0]
> regionserver.HRegionServer - ABORTING region server
> blitzhbase01-dnds3-4-crd.eng.sfdc.net,60020,1428318082860: Unrecoverable
> exception while closing region
> RMHA_1,\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1428318937003.884fd5819112370d9a9834895b0ec19c.,
> still finishing close
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)