[ 
https://issues.apache.org/jira/browse/HBASE-24117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17103026#comment-17103026
 ] 

Michael Stack commented on HBASE-24117:
---------------------------------------

Uploaded log with described failure. I got it by making below change and then 
running test in a loop. Took a long time to reproduce.... so long that I'm 
unscheduling this JIRA.

{code}
diff --git 
a/hbase-server/src/test/java/org/apache/hadoop/hbase/master/assignment/TestCloseRegionWhileRSCrash.java
 
b/hbase-server/src/test/java/org/apache/hadoop/hbase/master/assignment/TestCloseRegionWhileRSCrash.java
index 4761991e75..da4f36ea20 100644
--- 
a/hbase-server/src/test/java/org/apache/hadoop/hbase/master/assignment/TestCloseRegionWhileRSCrash.java
+++ 
b/hbase-server/src/test/java/org/apache/hadoop/hbase/master/assignment/TestCloseRegionWhileRSCrash.java
@@ -163,7 +163,8 @@ public class TestCloseRegionWhileRSCrash {
     UTIL.shutdownMiniCluster();
   }

-  @org.junit.Ignore @Test // Until root-cause of flakeyness, HBASE-24117, is 
addressed.
+  @Test
+  // @org.junit.Ignore @Test // Until root-cause of flakeyness, HBASE-24117, 
is addressed.
   public void testRetryBackoff() throws IOException, InterruptedException {
     HRegionServer srcRs = UTIL.getRSForFirstRegionInTable(TABLE_NAME);
     RegionInfo region = srcRs.getRegions(TABLE_NAME).get(0).getRegionInfo();
{code}

> If move target RS crashes, move fails if concurrent master crash
> ----------------------------------------------------------------
>
>                 Key: HBASE-24117
>                 URL: https://issues.apache.org/jira/browse/HBASE-24117
>             Project: HBase
>          Issue Type: Bug
>          Components: proc-v2
>            Reporter: Michael Stack
>            Assignee: Michael Stack
>            Priority: Major
>             Fix For: 3.0.0-alpha-1, 2.3.0
>
>         Attachments: 
> org.apache.hadoop.hbase.master.assignment.TestCloseRegionWhileRSCrash-output.txt
>
>
> I saw this on TestCloseRegionWithRSCrash. The Region 
> 788a516d1f86af98e0a16bcc1afe4fa1 was being moved to RS  
> example.com,62652,1586032098445 just after it was killed. The Move Close 
> fails because the RS has no node in the Master. The Move then tries to 
> 'confirm' the close but it fails because no remote RS. We are then to wait in 
> this state until operator or some other procedure intervenes to 'fix' the 
> state. Normally a ServerCrashProcedure would do the job but in this test the 
> Master is restarted after the RS is killed, a condition we do not accommodate.
> Let me attach the test log.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to