[
https://issues.apache.org/jira/browse/HBASE-13592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14518913#comment-14518913
]
Vikas Vishwakarma commented on HBASE-13592:
-------------------------------------------
Adding details related to the stack and log trace
{noformat}
=====================
STACK TRACE:
=====================
"regionserver60020" prio=10 tid=0x00007f70d153a000 nid=0xa105 in Object.wait()
[0x00007f68cba79000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at java.lang.Object.wait(Object.java:503)
at
org.apache.hadoop.hbase.util.DrainBarrier.stopAndDrainOps(DrainBarrier.java:115)
<---------
- locked <0x00007f69137f0270> (a
org.apache.hadoop.hbase.util.DrainBarrier)
at
org.apache.hadoop.hbase.util.DrainBarrier.stopAndDrainOps(DrainBarrier.java:85)
at
org.apache.hadoop.hbase.regionserver.wal.FSHLog.close(FSHLog.java:926)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.closeWAL(HRegionServer.java:1250)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1041)
at java.lang.Thread.run(Thread.java:745)
===================
LOG Trace
====================
2015-04-28 09:00:48,939 DEBUG [7,queue=2,port=60020] regionserver.HRegion -
Flush requested on
AA_1,\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1430210541446.0f6b76ceb0efaf2d52c211021f4ebde3.
2015-04-28 09:00:48,939 INFO [MemStoreFlusher.0] regionserver.HRegion -
Started memstore flush for
AA_1,\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1430210541446.0f6b76ceb0efaf2d52c211021f4ebde3.,
current region memstore size 999.2 M
2015-04-28 09:00:48,939 DEBUG [MemStoreFlusher.0] wal.FSHLog - vikas: beginOp
startCacheFlush for 0f6b76ceb0efaf2d52c211021f4ebde3
2015-04-28 09:00:48,939 WARN [MemStoreFlusher.0] regionserver.MemStore -
Snapshot called again without clearing previous. Doing nothing. Another ongoing
flush or did we fail last attempt?
2015-04-28 09:00:48,964 ERROR [MemStoreFlusher.0] regionserver.MemStoreFlusher
- Cache flush failed for region
AA_1,\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1430210541446.0f6b76ceb0efaf2d52c211021f4ebde3.
java.net.SocketTimeoutException: 70000 millis timeout while waiting for channel
to be ready for read. ch : java.nio.channels.SocketChannel[connected
local=/10.230.229.13:45046 remote=/10.230.229.28:50010]
at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118)
at java.io.FilterInputStream.read(FilterInputStream.java:83)
at java.io.FilterInputStream.read(FilterInputStream.java:83)
at
org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1985)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:1063)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1031)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1174)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:924)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:486)
....
multiple attempts and failures
.....
Finally the region is closed by CloseHandler
.....
2015-04-28 09:02:39,035 DEBUG [-dnds1-2-crd:60020-0] handler.CloseRegionHandler
- Processing close of
AA_1,\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1430210541446.0f6b76ceb0efaf2d52c211021f4ebde3.
2015-04-28 09:02:39,035 DEBUG [-dnds1-2-crd:60020-0] regionserver.HRegion -
Closing
AA_1,\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1430210541446.0f6b76ceb0efaf2d52c211021f4ebde3.:
disabling compactions & flushes
2015-04-28 09:02:39,035 DEBUG [-dnds1-2-crd:60020-0] regionserver.HRegion -
Updates disabled for region
AA_1,\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1430210541446.0f6b76ceb0efaf2d52c211021f4ebde3.
2015-04-28 09:02:39,039 INFO [-dnds1-2-crd:60020-0] regionserver.HRegion -
Closed
AA_1,\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1430210541446.0f6b76ceb0efaf2d52c211021f4ebde3.
2015-04-28 09:02:39,039 DEBUG [-dnds1-2-crd:60020-0] handler.CloseRegionHandler
- Closed
AA_1,\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1430210541446.0f6b76ceb0efaf2d52c211021f4ebde3.
.....
The Shutdown sequence however hangs after the following lines in the log. So
here in close asyncNotifier is closed followed by asyncSyncers followed by
asyncWriter after which closeBarrier.stopAndDrainOps() is called where it gets
stuck
.....
2015-04-28 10:56:12,489 INFO [regionserver60020] regionserver.HRegionServer -
stopping server blitzhbase01-dnds1-2-crd.eng.sfdc.net,60020,1430210003779; all
regions closed.
2015-04-28 10:56:12,489 DEBUG [20-WAL.AsyncNotifier] wal.FSHLog -
regionserver60020-WAL.AsyncNotifier interrupted while waiting for notification
from AsyncSyncer thread
2015-04-28 10:56:12,489 INFO [20-WAL.AsyncNotifier] wal.FSHLog -
regionserver60020-WAL.AsyncNotifier exiting
2015-04-28 10:56:12,490 DEBUG [020-WAL.AsyncSyncer0] wal.FSHLog -
regionserver60020-WAL.AsyncSyncer0 interrupted while waiting for notification
from AsyncWriter thread
2015-04-28 10:56:12,490 INFO [020-WAL.AsyncSyncer0] wal.FSHLog -
regionserver60020-WAL.AsyncSyncer0 exiting
2015-04-28 10:56:12,490 DEBUG [020-WAL.AsyncSyncer1] wal.FSHLog -
regionserver60020-WAL.AsyncSyncer1 interrupted while waiting for notification
from AsyncWriter thread
2015-04-28 10:56:12,490 INFO [020-WAL.AsyncSyncer1] wal.FSHLog -
regionserver60020-WAL.AsyncSyncer1 exiting
2015-04-28 10:56:12,490 DEBUG [020-WAL.AsyncSyncer2] wal.FSHLog -
regionserver60020-WAL.AsyncSyncer2 interrupted while waiting for notification
from AsyncWriter thread
2015-04-28 10:56:12,490 INFO [020-WAL.AsyncSyncer2] wal.FSHLog -
regionserver60020-WAL.AsyncSyncer2 exiting
2015-04-28 10:56:12,490 DEBUG [020-WAL.AsyncSyncer3] wal.FSHLog -
regionserver60020-WAL.AsyncSyncer3 interrupted while waiting for notification
from AsyncWriter thread
2015-04-28 10:56:12,490 INFO [020-WAL.AsyncSyncer3] wal.FSHLog -
regionserver60020-WAL.AsyncSyncer3 exiting
2015-04-28 10:56:12,490 DEBUG [020-WAL.AsyncSyncer4] wal.FSHLog -
regionserver60020-WAL.AsyncSyncer4 interrupted while waiting for notification
from AsyncWriter thread
2015-04-28 10:56:12,490 INFO [020-WAL.AsyncSyncer4] wal.FSHLog -
regionserver60020-WAL.AsyncSyncer4 exiting
2015-04-28 10:56:12,491 DEBUG [0020-WAL.AsyncWriter] wal.FSHLog -
regionserver60020-WAL.AsyncWriter interrupted while waiting for newer writes
added to local buffer
2015-04-28 10:56:12,491 INFO [0020-WAL.AsyncWriter] wal.FSHLog -
regionserver60020-WAL.AsyncWriter exiting
2015-04-28 10:56:12,491 INFO [regionserver60020] wal.FSHLog -
closeBarrier.stopAndDrainOps <----- (this log added for testing)
2015-04-28 10:56:13,925 DEBUG [ver60020-EventThread]
regionserver.SplitLogWorker - tasks arrived or departed
{noformat}
> RegionServer sometimes gets stuck during shutdown in case of cache flush
> failures
> ---------------------------------------------------------------------------------
>
> Key: HBASE-13592
> URL: https://issues.apache.org/jira/browse/HBASE-13592
> Project: HBase
> Issue Type: Bug
> Affects Versions: 0.98.10
> Reporter: Vikas Vishwakarma
> Assignee: Vikas Vishwakarma
>
> Observed that RegionServer sometimes gets stuck during shutdown in case of
> cache flush failures. On adding few debug logs and looking through the stack
> trace RegionServer process looks stuck in closeWAL -> hlog.close ->
> closeBarrier.stopAndDrainOps(); during the shutdown sequence in the run method
> From the RegionServer logs we see there are multiple attempts to flush cache
> for a particular region which increments the beginOp count in DrainBarrier
> but all the flush attempts fails somewhere in wal sync and the DrainBarrier
> endOp count decrement never happens. Later on when shutdown is initiated
> RegionServer process is permanently stuck here
> In this case hbase stop also does not work and RegionServer process has to be
> explicitly killed using kill -9
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)