stack created HBASE-21259:
-----------------------------

             Summary: [amv2] Revived deadservers; recreated serverstatenode
                 Key: HBASE-21259
                 URL: https://issues.apache.org/jira/browse/HBASE-21259
             Project: HBase
          Issue Type: Bug
          Components: amv2
    Affects Versions: 2.1.0
            Reporter: stack
            Assignee: stack
             Fix For: 2.2.0, 2.1.1, 2.0.3


On startup, I see servers being revived; i.e. their serverstatenode is getting 
marked online even though its just been processed by ServerCrashProcedure. It 
looks like this (in a patched server that reports on whenever a serverstatenode 
is created):

{code}
2018-09-29 03:45:40,963 INFO 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished pid=3982597, 
state=SUCCESS; ServerCrashProcedure 
server=vb1442.halxg.cloudera.com,22101,1536675314426, splitWal=true, meta=false 
in 1.0130sec
...

2018-09-29 03:45:43,733 INFO 
org.apache.hadoop.hbase.master.assignment.RegionStates: CREATING! 
vb1442.halxg.cloudera.com,22101,1536675314426
java.lang.RuntimeException: WHERE AM I?
        at 
org.apache.hadoop.hbase.master.assignment.RegionStates.getOrCreateServer(RegionStates.java:1116)
        at 
org.apache.hadoop.hbase.master.assignment.RegionStates.addRegionToServer(RegionStates.java:1143)
        at 
org.apache.hadoop.hbase.master.assignment.AssignmentManager.markRegionAsClosing(AssignmentManager.java:1464)
        at 
org.apache.hadoop.hbase.master.assignment.UnassignProcedure.updateTransition(UnassignProcedure.java:200)
        at 
org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:369)
        at 
org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:97)
        at 
org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:953)
        at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1716)
        at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1494)
        at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:75)
        at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:2022)

{code}

See how we've just finished a SCP which will have removed the 
serverstatenode... but then we come across an unassign that references the 
server that was just processed. The unassign will attempt to update the 
serverstatenode and therein we create one if one not present. We shouldn't be 
creating one.

I think I see this a lot because I am scheduling unassigns with hbck2. The 
servers crash and then come up with SCPs doing cleanup of old server and 
unassign procedures in the procedure executor queue to be processed still....  
but could happen at any time on cluster should an unassign happen get scheduled 
near an SCP.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to