[ 
https://issues.apache.org/jira/browse/HBASE-20796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-20796:
--------------------------
    Assignee: stack
      Status: Patch Available  (was: Open)

.001

    On assign, we can get multiple handleFailure calls; once
    from the SCP and then later, if the RPC is stuck on this
    server, as part of the RPC cleanup. Add checks so we drop
    the second call on the floor rather than 'process' it to
    'wake' a Procedure that has already handled the failure
    (If the Procedure is done, logs get filled with notice
    of the STUCK 'zombie' Procedure). UnassignProcedure,
    while it has a different form, already had protection
    against double running of failure handling; here we
    are adding it to AssignProcedure.

    M 
hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignProcedure.java
     Have handleFailure return true if it 'processed' the failure, else
     false if the call comes in and regionNode is not in expected state;
     if the latter, just ignore the call.

     Bug fix! We were calling offline of regionNode BEFORE we asked
     am to undoRegionAsOpening. This looks like it should be AFTER
     the am has done its cleanup; in fact, let am set the regionNode
     to OFFLINE as it has a lock on regionNode already.

     Only set Procedure to go back to the start if we 'processed'
     the failure; else just exit after logging we're ignoring
     the invocation.

    M 
hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignmentManager.java
     Return true if we ran an undo of OPENING state.
     Added setting node into OFFLINE state in here.

    M 
hbase-server/src/test/java/org/apache/hadoop/hbase/master/assignment/TestAssignProcedure.java
     This test seemed to be in wrong package. Moved it. And then added
     test that we don't do handleFailure twice.

> STUCK RIT though region successfully assigned
> ---------------------------------------------
>
>                 Key: HBASE-20796
>                 URL: https://issues.apache.org/jira/browse/HBASE-20796
>             Project: HBase
>          Issue Type: Bug
>          Components: amv2
>            Reporter: stack
>            Assignee: stack
>            Priority: Major
>             Fix For: 2.1.0
>
>         Attachments: HBASE-20796.branch-2.0.001.patch
>
>
> This is a good one. We keep logging messages like this:
> {code}
> 2018-06-26 12:32:24,859 WARN 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK 
> Region-In-Transition rit=OPENING, 
> location=vd0410.X.Y.com,22101,1529611445046, 
> table=IntegrationTestBigLinkedList_20180525080406, 
> region=e10b35d49528e2453a04c7038e3393d7
> {code}
> ...though the region is successfully assigned.
> Story:
>  * Dispatch an assign 2018-06-26 12:31:27,390 INFO 
> org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure: Dispatch 
> pid=370829, ppid=370391, state=RUNNABLE:REGION_TRANSITION_DISPATCH; 
> AssignProcedure table=IntegrationTestBigLinkedList_20180612114844, 
> region=f69ccf7d9178ce166b515e0e2ef019d2; rit=OPENING, 
> location=vd0410.X.Y.Z,22101,1529611445046
>  * It gets stuck 2018-06-26 12:32:29,860 WARN 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK 
> Region-In-Transition rit=OPENING, location=vd0410.X.Y.Z,22101,1529611445046, 
> table=IntegrationTestBigLinkedList_20180612114844, 
> region=f69ccf7d9178ce166b515e0e2ef019d2 (Because the server was killed)
>  * We stay STUCK for a while.
>  * The Master notices the server as crashed and starts a SCP.
>  * SCP kills ongoing assign: 2018-06-26 12:32:54,809 INFO 
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: pid=371105 
> found RIT pid=370829, ppid=370391, state=RUNNABLE:REGION_TRANSITION_DISPATCH; 
> AssignProcedure table=IntegrationTestBigLinkedList_20180612114844, 
> region=f69ccf7d9178ce166b515e0e2ef019d2; rit=OPENING, 
> location=vd0410.X.Y.Z,22101,1529611445046
>  * The kill brings on a retry ... 2018-06-26 12:32:54,810 WARN 
> org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure: Remote 
> call failed pid=370829, ppid=370391, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure 
> table=IntegrationTestBigLinkedList_20180612114844, 
> region=f69ccf7d9178ce166b515e0e2ef019d2; rit=OPENING, 
> location=vd0410.X.Y.Z,22101,1529611445046; exception=ServerCrashProcedure 
> pid=371105, server=vd0410.X.Y.Z,22101,1529611445046
>  * Which eventually succeeds..... Successfully deployed to new server 
> 2018-06-26 12:32:55,429 INFO 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished pid=370829, 
> ppid=370391, state=SUCCESS; AssignProcedure 
> table=IntegrationTestBigLinkedList_20180612114844, 
> region=f69ccf7d9178ce166b515e0e2ef019d2 in 1mins, 35.379sec
>  * But then, it looks like the RPC was ongoing and it broke in following way 
> 2018-06-26 12:33:06,378 WARN 
> org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure: Remote 
> call failed pid=370829, ppid=370391, state=SUCCESS; AssignProcedure 
> table=IntegrationTestBigLinkedList_20180612114844, 
> region=f69ccf7d9178ce166b515e0e2ef019d2; rit=OPEN, 
> location=vc0614.halxg.cloudera.com,22101,1529611443424; exception=Call to 
> vd0410.X.Y.Z/10.10.10.10:22101 failed on local exception: 
> org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: 
> syscall:read(..) failed: Connection reset by peer (Notice how state for 
> region is OPEN and 'SUCCESS').
>  * Then says 2018-06-26 12:33:06,380 INFO 
> org.apache.hadoop.hbase.master.assignment.AssignProcedure: Retry=1 of max=10; 
> pid=370829, ppid=370391, state=SUCCESS; AssignProcedure 
> table=IntegrationTestBigLinkedList_20180612114844, 
> region=f69ccf7d9178ce166b515e0e2ef019d2; rit=OPEN, 
> location=vc0614.X.Y.Z,22101,1529611443424
>  * And finally...  2018-06-26 12:34:10,727 WARN 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK 
> Region-In-Transition rit=OFFLINE, location=null, 
> table=IntegrationTestBigLinkedList_20180612114844, 
> region=f69ccf7d9178ce166b515e0e2ef019d2
> Restart of Master got rid of the STUCK complaints.
> This is interesting because the stuck rpc and the successful reassign are all 
> riding on the same pid.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to