stack created HBASE-21259:
-----------------------------
Summary: [amv2] Revived deadservers; recreated serverstatenode
Key: HBASE-21259
URL: https://issues.apache.org/jira/browse/HBASE-21259
Project: HBase
Issue Type: Bug
Components: amv2
Affects Versions: 2.1.0
Reporter: stack
Assignee: stack
Fix For: 2.2.0, 2.1.1, 2.0.3
On startup, I see servers being revived; i.e. their serverstatenode is getting
marked online even though its just been processed by ServerCrashProcedure. It
looks like this (in a patched server that reports on whenever a serverstatenode
is created):
{code}
2018-09-29 03:45:40,963 INFO
org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished pid=3982597,
state=SUCCESS; ServerCrashProcedure
server=vb1442.halxg.cloudera.com,22101,1536675314426, splitWal=true, meta=false
in 1.0130sec
...
2018-09-29 03:45:43,733 INFO
org.apache.hadoop.hbase.master.assignment.RegionStates: CREATING!
vb1442.halxg.cloudera.com,22101,1536675314426
java.lang.RuntimeException: WHERE AM I?
at
org.apache.hadoop.hbase.master.assignment.RegionStates.getOrCreateServer(RegionStates.java:1116)
at
org.apache.hadoop.hbase.master.assignment.RegionStates.addRegionToServer(RegionStates.java:1143)
at
org.apache.hadoop.hbase.master.assignment.AssignmentManager.markRegionAsClosing(AssignmentManager.java:1464)
at
org.apache.hadoop.hbase.master.assignment.UnassignProcedure.updateTransition(UnassignProcedure.java:200)
at
org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:369)
at
org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:97)
at
org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:953)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1716)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1494)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:75)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:2022)
{code}
See how we've just finished a SCP which will have removed the
serverstatenode... but then we come across an unassign that references the
server that was just processed. The unassign will attempt to update the
serverstatenode and therein we create one if one not present. We shouldn't be
creating one.
I think I see this a lot because I am scheduling unassigns with hbck2. The
servers crash and then come up with SCPs doing cleanup of old server and
unassign procedures in the procedure executor queue to be processed still....
but could happen at any time on cluster should an unassign happen get scheduled
near an SCP.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)