[ https://issues.apache.org/jira/browse/HBASE-21440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16686060#comment-16686060 ]
Allan Yang edited comment on HBASE-21440 at 11/14/18 4:28 AM: -------------------------------------------------------------- +1 on v4 patch(fix the checkstyle please). Maybe it is HBASE-21468 making these test flaky, I have committed HBASE-21468 to branch-2.0 and branch-2.1. [~an...@apache.org], you can re-trigger a QA run here. Sorry for that. was (Author: allan163): +1 on v4 patch. Maybe it is HBASE-21468 making these test flaky, I have committed HBASE-21468 to branch-2.0 and branch-2.1. [~an...@apache.org], you can re-trigger a QA run here. Sorry for that. > Assign procedure on the crashed server is not properly interrupted > ------------------------------------------------------------------ > > Key: HBASE-21440 > URL: https://issues.apache.org/jira/browse/HBASE-21440 > Project: HBase > Issue Type: Bug > Affects Versions: 2.0.2 > Reporter: Ankit Singhal > Assignee: Ankit Singhal > Priority: Major > Attachments: HBASE-21440.branch-2.0.001.patch, > HBASE-21440.branch-2.0.002.patch, HBASE-21440.branch-2.0.003.patch, > HBASE-21440.branch-2.0.004.patch > > > When the server crashes, it's SCP checks if there is already a procedure > assigning the region on this crashed server. If we found one, SCP will just > interrupt the already running AssignProcedure by calling remoteCallFailed > which internally just changes the region node state to OFFLINE and send the > procedure back with transition queue state for assignment with a new plan. > But, due to the race condition between the calling of the remoteCallFailed > and current state of the already running assign > procedure(REGION_TRANSITION_FINISH: where the region is already opened), it > is possible that assign procedure goes ahead in updating the regionStateNode > to OPEN on a crashed server. > As SCP had already skipped this region for assignment as it was relying on > existing assign procedure to do the right thing, this whole confusion leads > region to a not accessible state. -- This message was sent by Atlassian JIRA (v7.6.3#76005)