elek opened a new pull request #993: HDDS-1709. TestScmSafeNode is flaky
URL: https://github.com/apache/hadoop/pull/993
 
 
   org.apache.hadoop.ozone.om.TestScmSafeMode.testSCMSafeMode is failed at last 
night with the following error:
   {code:java}
   java.lang.AssertionError at org.junit.Assert.fail(Assert.java:86) at 
org.junit.Assert.assertTrue(Assert.java:41) at 
org.junit.Assert.assertTrue(Assert.java:52) at 
org.apache.hadoop.ozone.om.TestScmSafeMode.testSCMSafeMode(TestScmSafeMode.java:285)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498) at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
 at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
 at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
 at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
 at 
org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74){code}
   Locally it can be tested but it's very easy to reproduce by adding an 
additional sleep DataNodeSafeModeRule:
   {code:java}
   +++ 
b/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/DataNodeSafeModeRule.java
   @@ -63,7 +63,11 @@ protected boolean validate() {
    
      @Override
      protected void process(NodeRegistrationContainerReport reportsProto) {
   -
   +    try {
   +      Thread.sleep(3000);
   +    } catch (InterruptedException e) {
   +      e.printStackTrace();
   +    }{code}
   This is a clear race condition:
   
   DatanodeSafeModeRule and ContainerSafeModeRule are processing the same 
events but it can be possible (in case of an accidental sleep) that the 
container safe mode rule is done, but DatanodeSafeModeRule didn't process the 
new event (yet).
   
   As a result the test execution will continue:
   {code:java}
   GenericTestUtils
       .waitFor(() -> scm.getCurrentContainerThreshold() == 1.0, 100, 20000);
   {code}
   (This line is waiting ONLY for the ContainerSafeModeRule).
   
   The fix is easy, let's wait for the processing of all the async events:
   {code:java}
   EventQueue eventQueue =
       (EventQueue) cluster.getStorageContainerManager().getEventQueue();
   eventQueue.processAll(5000L);{code}
   As we are sure that the events are already sent to the EventQueue (because 
we have the previous waitFor), it should be enough.
   
   See: https://issues.apache.org/jira/browse/HDDS-1709

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to