[ 
https://issues.apache.org/jira/browse/HBASE-21259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16642633#comment-16642633
 ] 

stack commented on HBASE-21259:
-------------------------------

Let me put up patch .002. It plugs the holes mentioned above. It still has the 
debug that [~allan163] pointed to in his review of this patch, for the moment. 
This new patch results in this sort of thing where we skip the SCP if no 
mention of crashed server in meta, fs, or dead servers list:

{code}

2018-10-08 16:39:29,616 DEBUG 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Stored pid=961713, 
state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure 
table=IntegrationTestBigLinkedList_20180620132845, 
region=79b5367f6579126707c231efdac9fd63, 
server=vd0803.halxg.cloudera.com,22101,1538185201494
2018-10-08 16:39:29,616 INFO 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished pid=961711, 
state=SUCCESS; UnassignProcedure 
table=IntegrationTestBigLinkedList_20180620132845, 
region=566a0a0a93cb4705bf36dc84db7a7205, 
server=ve0815.halxg.cloudera.com,22101,1538114138702 in 108msec
2018-10-08 16:39:29,616 INFO 
org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: xlock for 
pid=961713, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure 
table=IntegrationTestBigLinkedList_20180620132845, 
region=79b5367f6579126707c231efdac9fd63, 
server=vd0803.halxg.cloudera.com,22101,1538185201494
2018-10-08 16:39:29,637 INFO 
org.apache.hadoop.hbase.master.assignment.RegionStateStore: pid=961713 updating 
hbase:meta row=79b5367f6579126707c231efdac9fd63, regionState=CLOSING, 
regionLocation=vd0803.halxg.cloudera.com,22101,1538185201494
2018-10-08 16:39:29,637 WARN org.apache.hadoop.hbase.ipc.RpcServer: 
(responseTooSlow): 
{"call":"Unassigns(org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$UnassignsRequest)","starttimems":1539041944370,"responsesize":1024,"method":"Unassigns","param":"TODO:
 class 
org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$UnassignsRequest","processingtimems":25246,"client":"10.17.208.17:42428","queuetimems":0,"class":"HMaster"}
2018-10-08 16:39:29,638 INFO 
org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure: Dispatch 
pid=961713, state=RUNNABLE:REGION_TRANSITION_DISPATCH, locked=true; 
UnassignProcedure table=IntegrationTestBigLinkedList_20180620132845, 
region=79b5367f6579126707c231efdac9fd63, 
server=vd0803.halxg.cloudera.com,22101,1538185201494
2018-10-08 16:39:29,638 WARN 
org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure: Remote 
call failed rit=CLOSING, 
location=vd0803.halxg.cloudera.com,22101,1538185201494; pid=961713, 
state=RUNNABLE:REGION_TRANSITION_DISPATCH, locked=true; UnassignProcedure 
table=IntegrationTestBigLinkedList_20180620132845, 
region=79b5367f6579126707c231efdac9fd63, 
server=vd0803.halxg.cloudera.com,22101,1538185201494
org.apache.hadoop.hbase.procedure2.NoServerDispatchException: 
vd0803.halxg.cloudera.com,22101,1538185201494; pid=961713, 
state=RUNNABLE:REGION_TRANSITION_DISPATCH, locked=true; UnassignProcedure 
table=IntegrationTestBigLinkedList_20180620132845, 
region=79b5367f6579126707c231efdac9fd63, 
server=vd0803.halxg.cloudera.com,22101,1538185201494
        at 
org.apache.hadoop.hbase.procedure2.RemoteProcedureDispatcher.addOperationToNode(RemoteProcedureDispatcher.java:177)
        at 
org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.addToRemoteDispatcher(RegionTransitionProcedure.java:276)
        at 
org.apache.hadoop.hbase.master.assignment.UnassignProcedure.updateTransition(UnassignProcedure.java:206)
        at 
org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:369)
        at 
org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:97)
        at 
org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:953)
        at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1727)
        at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1495)
        at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:75)
        at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:2033)
2018-10-08 16:39:29,639 WARN 
org.apache.hadoop.hbase.master.assignment.UnassignProcedure: Expiring 
vd0803.halxg.cloudera.com,22101,1538185201494, pid=961713, 
state=RUNNABLE:REGION_TRANSITION_DISPATCH, locked=true; UnassignProcedure 
table=IntegrationTestBigLinkedList_20180620132845, 
region=79b5367f6579126707c231efdac9fd63, 
server=vd0803.halxg.cloudera.com,22101,1538185201494 rit=CLOSING, 
location=vd0803.halxg.cloudera.com,22101,1538185201494; 
exception=NoServerDispatchException
2018-10-08 16:39:29,643 INFO org.apache.hadoop.hbase.master.ServerManager: 
Skipping expire; vd0803.halxg.cloudera.com,22101,1538185201494 is not online, 
not in deadservers, not in fs -- presuming long gone server instance!


{code}

This patch makes it so hbck2 can unassign CLOSING methods. Previously, calling 
unassign on a region that was against a server long gone would result in an SCP 
that found no logs and because it was not carrying any regions, failed to 
cleanup the hung RPC dispatch that made the CLOSE call to the non-existent 
server. Skipping it saves the SCP that does nothing and avoids this hung 
RPC/STUCK procedure.

> [amv2] Revived deadservers; recreated serverstatenode
> -----------------------------------------------------
>
>                 Key: HBASE-21259
>                 URL: https://issues.apache.org/jira/browse/HBASE-21259
>             Project: HBase
>          Issue Type: Bug
>          Components: amv2
>    Affects Versions: 2.1.0
>            Reporter: stack
>            Assignee: stack
>            Priority: Critical
>             Fix For: 2.2.0, 2.1.1, 2.0.3
>
>         Attachments: HBASE-21259.branch-2.1.001.patch, 
> HBASE-21259.branch-2.1.002.patch
>
>
> On startup, I see servers being revived; i.e. their serverstatenode is 
> getting marked online even though its just been processed by 
> ServerCrashProcedure. It looks like this (in a patched server that reports on 
> whenever a serverstatenode is created):
> {code}
> 2018-09-29 03:45:40,963 INFO 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished pid=3982597, 
> state=SUCCESS; ServerCrashProcedure 
> server=vb1442.halxg.cloudera.com,22101,1536675314426, splitWal=true, 
> meta=false in 1.0130sec
> ...
> 2018-09-29 03:45:43,733 INFO 
> org.apache.hadoop.hbase.master.assignment.RegionStates: CREATING! 
> vb1442.halxg.cloudera.com,22101,1536675314426
> java.lang.RuntimeException: WHERE AM I?
>         at 
> org.apache.hadoop.hbase.master.assignment.RegionStates.getOrCreateServer(RegionStates.java:1116)
>         at 
> org.apache.hadoop.hbase.master.assignment.RegionStates.addRegionToServer(RegionStates.java:1143)
>         at 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.markRegionAsClosing(AssignmentManager.java:1464)
>         at 
> org.apache.hadoop.hbase.master.assignment.UnassignProcedure.updateTransition(UnassignProcedure.java:200)
>         at 
> org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:369)
>         at 
> org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:97)
>         at 
> org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:953)
>         at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1716)
>         at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1494)
>         at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:75)
>         at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:2022)
> {code}
> See how we've just finished a SCP which will have removed the 
> serverstatenode... but then we come across an unassign that references the 
> server that was just processed. The unassign will attempt to update the 
> serverstatenode and therein we create one if one not present. We shouldn't be 
> creating one.
> I think I see this a lot because I am scheduling unassigns with hbck2. The 
> servers crash and then come up with SCPs doing cleanup of old server and 
> unassign procedures in the procedure executor queue to be processed still.... 
>  but could happen at any time on cluster should an unassign happen get 
> scheduled near an SCP.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to