[jira] [Commented] (HBASE-21440) Assign procedure on the crashed server is not properly interrupted

Ankit Singhal (JIRA) Tue, 13 Nov 2018 11:20:19 -0800


    [ 
https://issues.apache.org/jira/browse/HBASE-21440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16685628#comment-16685628
 ]


Ankit Singhal commented on HBASE-21440:
---------------------------------------

bq. Legit failures then in your opinion sir? Related?
Actually,  test failures seem to be not related. (and also the code path in the 
patch is not even accessed during these tests).

TestMasterFailoverWithProcedures is flaky, I can see it is getting failed 
multiple times in night builds.
{code}
https://builds.apache.org/job/HBase%20Nightly/job/branch-2.0/1065/
https://builds.apache.org/job/HBase%20Nightly/job/branch-2.0/1064/
https://builds.apache.org/job/HBase%20Nightly/job/branch-2.0/1068/
{code}

So asked [[email protected]] help to verify TestMasterFailoverWithProcedures 
with my patch, and he also finds it passing locally. Thanks, Ted.

TestMergeTableRegionsProcedure fails everytime in branch-2.0/2.1 at least. 
{code}
https://builds.apache.org/job/HBase-Flaky-Tests/job/branch-2.0/1904/#showFailuresLink
{code}

[~stack] , WDYT, can this be committed now?


> Assign procedure on the crashed server is not properly interrupted
> ------------------------------------------------------------------
>
>                 Key: HBASE-21440
>                 URL: https://issues.apache.org/jira/browse/HBASE-21440
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 2.0.2
>            Reporter: Ankit Singhal
>            Assignee: Ankit Singhal
>            Priority: Major
>         Attachments: HBASE-21440.branch-2.0.001.patch, 
> HBASE-21440.branch-2.0.002.patch, HBASE-21440.branch-2.0.003.patch, 
> HBASE-21440.branch-2.0.004.patch
>
>
> When the server crashes, it's SCP checks if there is already a procedure 
> assigning the region on this crashed server. If we found one, SCP will just 
> interrupt the already running AssignProcedure by calling remoteCallFailed 
> which internally just changes the region node state to OFFLINE and send the 
> procedure back with transition queue state for assignment with a new plan. 
> But, due to the race condition between the calling of the remoteCallFailed 
> and current state of the already running assign 
> procedure(REGION_TRANSITION_FINISH: where the region is already opened), it 
> is possible that assign procedure goes ahead in updating the regionStateNode 
> to OPEN on a crashed server. 
> As SCP had already skipped this region for assignment as it was relying on 
> existing assign procedure to do the right thing, this whole confusion leads 
> region to a not accessible state.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21440) Assign procedure on the crashed server is not properly interrupted

Reply via email to