[jira] [Commented] (HBASE-18366) Fix flaky test hbase.master.procedure.TestServerCrashProcedure#testRecoveryAndDoubleExecutionOnRsWithMeta

Umesh Agashe (JIRA) Tue, 11 Jul 2017 22:48:53 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-18366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16083428#comment-16083428
 ]


Umesh Agashe commented on HBASE-18366:
--------------------------------------

Thanks [~stack], [~yangzhe1991]!

I think its timing issue as I have seen it passing too! But for me its failing 
much more number of times than passing. I am still debugging it. From what I 
see:
TableNotFoundException is for table 
'testRecoveryAndDoubleExecution-carryingMeta-true'. This table is created by 
the test and exception is thrown in util.countRows() when table is scanned, in 
following code snippet:

{code}
      // Now run through the procedure twice crashing the executor on each 
step...
      MasterProcedureTestingUtility.testRecoveryAndDoubleExecution(procExec, 
procId);
      // Assert all data came back.
      assertEquals(count, util.countRows(t));
{code}

Here is the exception:
{code}
org.apache.hadoop.hbase.TableNotFoundException: 
testRecoveryAndDoubleExecution-carryingMeta-true
  at 
org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegionInMeta(ConnectionImplementation.java:845)
  at 
org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegion(ConnectionImplementation.java:745)
  at 
org.apache.hadoop.hbase.client.ConnectionImplementation.relocateRegion(ConnectionImplementation.java:720)
  at 
org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.getRegionLocations(RpcRetryingCallerWithReadReplicas.java:316)
  at 
org.apache.hadoop.hbase.client.ScannerCallable.prepare(ScannerCallable.java:139)
  at 
org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.prepare(ScannerCallableWithReplicas.java:399)
  at 
org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:104)
  at 
org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:80)
  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  at java.lang.Thread.run(Thread.java:745)
{code}

At this time I am not quite sure about how changes for HBASE-17931 are 
affecting the test but after reverting the changes locally I ran test 4-5 times 
and it passed all the time. If meta region is being transitioned while scan is 
going on, we can see this exception but I will have to confirm thats the case 
here.

AssignmentManager.checkIfShouldMoveSystemRegionAsync() is being called during 
active master initialization and from RegionServerTracker.refresh() and 
moveAsync() is used to submit the procedure. This can explain timing issue. If 
I can not get to bottom of this by tomorrow, I will disable the test and 
continue working on it.


> Fix flaky test 
> hbase.master.procedure.TestServerCrashProcedure#testRecoveryAndDoubleExecutionOnRsWithMeta
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-18366
>                 URL: https://issues.apache.org/jira/browse/HBASE-18366
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Umesh Agashe
>            Assignee: Umesh Agashe
>
> It worked for a few days after enabling it with HBASE-18278. But started 
> failing after commits:
> 6786b2b
> 68436c9
> 75d2eca
> 50bb045
> df93c13
> It works with one commit before: c5abb6c. Need to see what changed with those 
> commits.
> Currently it fails with TableNotFoundException.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (HBASE-18366) Fix flaky test hbase.master.procedure.TestServerCrashProcedure#testRecoveryAndDoubleExecutionOnRsWithMeta

Reply via email to