[
https://issues.apache.org/jira/browse/HBASE-19828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16332998#comment-16332998
]
stack commented on HBASE-19828:
-------------------------------
Ok. Interesting. Master as a RegionServer DOES NOT WORK, at least running
shutdown but likely, elsewhere too; acting-as-a-RegionServer can block Master
function. AMv2 and RegionServer are mutually exclusive. Region Locks taken by
AMv2 to ensure single-writer changing Region state can clash with
Master-acting-as-RegionServer updating its online region state, so much so,
that it locks up shutdown (HBASE-19830).
Here is example from a hung shutdown.
Here is the Master blocked trying to do a regionServerReport (The stack trace
looks a little odd because we are doing short-circuit RPC). It is blocked on a
RegionNode lock.
{{"M:1;localhost:55676" #1290 daemon prio=5 os_prio=31 tid=0x00007fd48acdc000
nid=0x2380b waiting for monitor entry [0x0000700022344000]}}
{{ java.lang.Thread.State: BLOCKED (on object monitor)}}
{{ at
org.apache.hadoop.hbase.master.assignment.AssignmentManager.checkOnlineRegionsReport(AssignmentManager.java:1023)}}
{{ - waiting to lock <0x000000078cf3c648> (a
org.apache.hadoop.hbase.master.assignment.RegionStates$RegionStateNode)}}
{{ at
org.apache.hadoop.hbase.master.assignment.AssignmentManager.reportOnlineRegions(AssignmentManager.java:975)}}
{{ at
org.apache.hadoop.hbase.master.MasterRpcServices.regionServerReport(MasterRpcServices.java:454)}}
{{ at
org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:1187)}}
{{ at
org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:993)}}
{{ at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:566)}}
{{ at java.lang.Thread.run(Thread.java:745)}}
The RegionNode lock is held by the thread trying to do this:
{code:java}
"ProcedureExecutorWorker-5" #1846 daemon prio=5 os_prio=31
tid=0x00007fd490145000 nid=0x3430b in Object.wait() [0x0000700013c98000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at
org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.waitUntilDone(AsyncRequestFutureImpl.java:1228)
- locked <0x000000078ec59ae0> (a java.util.concurrent.atomic.AtomicLong)
at
org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.waitUntilDone(AsyncRequestFutureImpl.java:1197)
at org.apache.hadoop.hbase.client.HTable.doBatchWithCallback(HTable.java:485)
at
org.apache.hadoop.hbase.util.MultiHConnection.processBatchCallback(MultiHConnection.java:122)
at
org.apache.hadoop.hbase.master.assignment.RegionStateStore.updateRegionLocation(RegionStateStore.java:222)
at
org.apache.hadoop.hbase.master.assignment.RegionStateStore.updateUserRegionLocation(RegionStateStore.java:209)
at
org.apache.hadoop.hbase.master.assignment.RegionStateStore.updateRegionLocation(RegionStateStore.java:149)
at
org.apache.hadoop.hbase.master.assignment.AssignmentManager.markRegionAsClosing(AssignmentManager.java:1536)
- locked <0x000000078cf3c648> (a
org.apache.hadoop.hbase.master.assignment.RegionStates$RegionStateNode)
at
org.apache.hadoop.hbase.master.assignment.UnassignProcedure.updateTransition(UnassignProcedure.java:179)
at
org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:309)
at
org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:85)
at org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:845)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1456)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1225)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$800(ProcedureExecutor.java:78)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1736){code}
But the server hosting hbase:meta has already gone away (cluster shutdown at
end of test). We are stuck here until all retries and timeouts are done.
At first, I thought the hung shutdown a matter of interrupting ongoing RPCs out
to update an hbase:meta that was not going to come back (this is a cluster
shutdown hang). I implemented this but then we were still hanging with
scenarios such as the above.
In short, this test manufactures scenarios that are likely rare in production
but possible. Absolute locks, though narrow, per regioninfo, are problematic.
Master as RegionServer needs more thought/work. I'm going to disable this test.
> Flakey TestRegionsOnMasterOptions.testRegionsOnAllServers
> ---------------------------------------------------------
>
> Key: HBASE-19828
> URL: https://issues.apache.org/jira/browse/HBASE-19828
> Project: HBase
> Issue Type: Bug
> Reporter: stack
> Assignee: stack
> Priority: Major
> Fix For: 2.0.0-beta-2
>
>
> This test is failing 50% of the time now. We seem to have made it fail more
> w/ our recent changes. The failure here is a good one, again, a real issue.
> We can get stuck trying to update meta with a Region state if the cluster is
> going down and hbase:meta has gone down before our client; we get locked-up
> retrying the put to hbase:meta for so long, the test times out:
>
> {code:java}
> Thread 2080 (ProcExecWrkr-7):
> State: TIMED_WAITING
> Blocked count: 0
> Waited count: 5960
> Stack:
> java.lang.Object.wait(Native Method)
> org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.waitUntilDone(AsyncRequestFutureImpl.java:1228)
> org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.waitUntilDone(AsyncRequestFutureImpl.java:1197)
> org.apache.hadoop.hbase.client.HTable.doBatchWithCallback(HTable.java:485)
> org.apache.hadoop.hbase.util.MultiHConnection.processBatchCallback(MultiHConnection.java:122)
> org.apache.hadoop.hbase.master.assignment.RegionStateStore.updateRegionLocation(RegionStateStore.java:222)
> org.apache.hadoop.hbase.master.assignment.RegionStateStore.updateUserRegionLocation(RegionStateStore.java:209)
> org.apache.hadoop.hbase.master.assignment.RegionStateStore.updateRegionLocation(RegionStateStore.java:149)
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.markRegionAsClosing(AssignmentManager.java:1536)
> org.apache.hadoop.hbase.master.assignment.UnassignProcedure.updateTransition(UnassignProcedure.java:179)
> org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:309)
> org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:85)
> org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:845)
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1456)
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1225)
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$800(ProcedureExecutor.java:78)
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1736){code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)