[
https://issues.apache.org/jira/browse/HBASE-21325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16656459#comment-16656459
]
Guanghao Zhang commented on HBASE-21325:
----------------------------------------
Write a ut for this case. And found the regionserver not hang in
waitOnAllRegionsToClose. As we will break the loop even there are online
regions.
{code:java}
// No regions in RIT, we could stop waiting now.
if (this.regionsInTransitionInRS.isEmpty()) {
if (!isOnlineRegionsEmpty()) {
LOG.info("We were exiting though online regions are not empty," +
" because some regions failed closing");
}
break;
}
{code}
2018-10-19 16:26:28,449 INFO [RS:1;hao-OptiPlex-7050:37602]
regionserver.HRegionServer(1426): We were exiting though online regions are not
empty, because some regions failed closing
But the regionserver still hang in shutdown wal when stop.
{code:java}
"RS:1;hao-OptiPlex-7050:37602" daemon prio=5 tid=380 in Object.wait()
java.lang.Thread.State: WAITING (on object monitor)
at sun.misc.Unsafe.park(Native Method)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
at
java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:224)
at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
at
org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.shutdown(AbstractFSWAL.java:821)
at
org.apache.hadoop.hbase.wal.SyncReplicationWALProvider.shutdown(SyncReplicationWALProvider.java:225)
at org.apache.hadoop.hbase.wal.WALFactory.shutdown(WALFactory.java:246)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.shutdownWAL(HRegionServer.java:1459)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1115)
at
org.apache.hadoop.hbase.MiniHBaseCluster$MiniHBaseClusterRegionServer.runRegionServer(MiniHBaseCluster.java:184)
at
org.apache.hadoop.hbase.MiniHBaseCluster$MiniHBaseClusterRegionServer.access$000(MiniHBaseCluster.java:130)
at
org.apache.hadoop.hbase.MiniHBaseCluster$MiniHBaseClusterRegionServer$1.run(MiniHBaseCluster.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:360)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1742)
at
org.apache.hadoop.hbase.security.User$SecureHadoopUser.runAs(User.java:341)
at
org.apache.hadoop.hbase.MiniHBaseCluster$MiniHBaseClusterRegionServer.run(MiniHBaseCluster.java:165)
at java.lang.Thread.run(Thread.java:748)
{code}
> Add a max wait time for waitOnAllRegionsToClose
> -----------------------------------------------
>
> Key: HBASE-21325
> URL: https://issues.apache.org/jira/browse/HBASE-21325
> Project: HBase
> Issue Type: Improvement
> Reporter: Duo Zhang
> Assignee: Guanghao Zhang
> Priority: Major
>
> When testing sync replication, I found that, if I transit the remote cluster
> to DA, while the local cluster is still in A, the region server will hang
> when shutdown. As the fsOk flag only test the local cluster(which is
> reasonable), we will enter the waitOnAllRegionsToClose, and since the WAL is
> broken(the remote wal directory is gone) so we will never succeed. And this
> lead to an infinite wait inside waitOnAllRegionsToClose.
> So I think here we should have an upper bound for the wait time in
> waitOnAllRegionsToClose method.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)