[ 
https://issues.apache.org/jira/browse/HBASE-13418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14484607#comment-14484607
 ] 

Vikas Vishwakarma commented on HBASE-13418:
-------------------------------------------

[~esteban] I am seeing a socket timeout immediately after this 
{noformat}

2015-04-06 11:47:33,344 INFO  [-dnds3-4-crd:60020-0] regionserver.HRegionServer 
- STOPPED: Unrecoverable exception while closing region 
RMHA_1,\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1428318937003.884fd5819112370d9a9834895b0ec19c.,
 still finishing close
2015-04-06 11:47:33,344 INFO  [regionserver60020] ipc.RpcServer - Stopping 
server on 60020
2015-04-06 11:47:33,344 ERROR [-dnds3-4-crd:60020-0] executor.EventHandler - 
Caught throwable while processing event M_RS_CLOSE_REGION
java.lang.RuntimeException: java.net.SocketTimeoutException: 70000 millis 
timeout while waiting for channel to be ready for read. ch : 
java.nio.channels.SocketChannel[connected local=/10.230.229.25:42201 
remote=/10.230.229.20:50010]
        at 
org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler.process(CloseRegionHandler.java:165)
        at 
org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.SocketTimeoutException: 70000 millis timeout while waiting 
for channel to be ready for read. ch : 
java.nio.channels.SocketChannel[connected local=/10.230.229.25:42201 
remote=/10.230.229.20:50010]
{noformat}

Few important log traces from above :
{noformat}
2015-04-06 11:47:33,319 INFO ... regionserver.HRegion - Started memstore flush 
for ... 884fd5819112370d9a9834895b0ec19c ..
..
2015-04-06 11:47:33,328 INFO ... regionserver.HRegion - Started memstore flush 
for ... 884fd5819112370d9a9834895b0ec19c ..
...
2015-04-06 11:47:33,328 WARN ... regionserver.MemStore - Snapshot called again 
without clearing previous. Doing nothing. Another ongoing flush or did we fail 
last attempt?
...
2015-04-06 11:47:33,334 FATAL ... regionserver.HRegionServer - ABORTING region 
server xyz,60020,1428318082860: Unrecoverable exception while closing region 
...
2015-04-06 11:47:33,344 ERROR ... executor.EventHandler - Caught throwable 
while processing event M_RS_CLOSE_REGION
java.lang.RuntimeException: java.net.SocketTimeoutException: 70000 millis 
timeout while waiting for channel to be ready for read. ch 
{noformat}

What I am suspecting is as follows:
>From above logs it looks like somehow two threads are trying to run close 
>concurrently on the same region. In HRegion class, for region close path, I am 
>seeing one possible concurrency issue in internalFlushcache during memstore 
>snapshoting & flush before close. Here 
>MultiVersionConsistencyControl.WriteEntry is getting initialized with a lock, 
>but mvcc.waitForRead(w); is happening after releasing the lock. In this case 
>second thread can re-initialize the w object (w = mvcc.beginMemstoreInsert()) 
>while the first thread is still using this object ( mvcc.waitForRead(w)). This 
>could lead to a case where waitForRead() continues to wait indefinitely 
>ultimately timing out. 


------------ HRegion.java ---------------
{noformat}
protected FlushResult internalFlushcache(final HLog wal, final long myseqid, 
MonitoredTask status) {
...

MultiVersionConsistencyControl.WriteEntry w = null;  

this.updatesLock.writeLock().lock();

try {

....

w = mvcc.beginMemstoreInsert(); //<-- long nextWriteNumber = ++memstoreWrite; 
WriteEntry e = new WriteEntry(nextWriteNumber); return e

...

} finally {

      this.updatesLock.writeLock().unlock();

    }

.....

    mvcc.waitForRead(w);  //<-- Concurrency issue if first thread is still 
using w object and next thread runs mvcc.beginMemstoreInsert() which will 
re-initialize w leading to indefinite wait and timeout
{noformat}

So I think moving mvcc.waitForRead(w); within the locking scope of 
updatesLock.writeLock() should solve this.

Any thoughts on this ?



> Regions getting stuck in PENDING_CLOSE state infinitely in high load HA 
> scenarios
> ---------------------------------------------------------------------------------
>
>                 Key: HBASE-13418
>                 URL: https://issues.apache.org/jira/browse/HBASE-13418
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.98.10
>            Reporter: Vikas Vishwakarma
>
> In some heavy data load cases when there are multiple RegionServers going 
> up/down (HA) or when we try to shutdown/restart the entire HBase cluster, we 
> are observing that some regions are getting stuck in PENDING_CLOSE state 
> infinitely. 
> On going through the logs for a particular region stuck in PENDING_CLOSE 
> state, it looks like for this region two memstore flush got triggered within 
> few milliseconds as given below and after sometime there is Unrecoverable 
> exception while closing region. I am suspecting this could be some kind of 
> race condition but need to check further
> Logs:
> ================
> ......
> 2015-04-06 11:47:33,309 INFO  [2,queue=0,port=60020] 
> regionserver.HRegionServer - Close 884fd5819112370d9a9834895b0ec19c, via 
> zk=yes, znode version=0, on 
> blitzhbase01-dnds1-4-crd.eng.sfdc.net,60020,1428318111711
> 2015-04-06 11:47:33,309 DEBUG [-dnds3-4-crd:60020-0] 
> handler.CloseRegionHandler - Processing close of 
> RMHA_1,\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1428318937003.884fd5819112370d9a9834895b0ec19c.
> 2015-04-06 11:47:33,319 DEBUG [-dnds3-4-crd:60020-0] regionserver.HRegion - 
> Closing 
> RMHA_1,\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1428318937003.884fd5819112370d9a9834895b0ec19c.:
>  disabling compactions & flushes
> 2015-04-06 11:47:33,319 INFO  [-dnds3-4-crd:60020-0] regionserver.HRegion - 
> Running close preflush of 
> RMHA_1,\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1428318937003.884fd5819112370d9a9834895b0ec19c.
> 2015-04-06 11:47:33,319 INFO  [-dnds3-4-crd:60020-0] regionserver.HRegion - 
> Started memstore flush for 
> RMHA_1,\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1428318937003.884fd5819112370d9a9834895b0ec19c.,
>  current region memstore size 70.0 M
> 2015-04-06 11:47:33,327 DEBUG [-dnds3-4-crd:60020-0] regionserver.HRegion - 
> Updates disabled for region 
> RMHA_1,\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1428318937003.884fd5819112370d9a9834895b0ec19c.
> 2015-04-06 11:47:33,328 INFO  [-dnds3-4-crd:60020-0] regionserver.HRegion - 
> Started memstore flush for 
> RMHA_1,\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1428318937003.884fd5819112370d9a9834895b0ec19c.,
>  current region memstore size 70.0 M
> 2015-04-06 11:47:33,328 WARN  [-dnds3-4-crd:60020-0] wal.FSHLog - Couldn't 
> find oldest seqNum for the region we are about to flush: 
> [884fd5819112370d9a9834895b0ec19c]
> 2015-04-06 11:47:33,328 WARN  [-dnds3-4-crd:60020-0] regionserver.MemStore - 
> Snapshot called again without clearing previous. Doing nothing. Another 
> ongoing flush or did we fail last attempt?
> 2015-04-06 11:47:33,334 FATAL [-dnds3-4-crd:60020-0] 
> regionserver.HRegionServer - ABORTING region server 
> blitzhbase01-dnds3-4-crd.eng.sfdc.net,60020,1428318082860: Unrecoverable 
> exception while closing region 
> RMHA_1,\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1428318937003.884fd5819112370d9a9834895b0ec19c.,
>  still finishing close



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to