Re: Review Request 64523: Attempt #2 to fix flaky Webhook test

Bill Farner Mon, 11 Dec 2017 20:54:40 -0800


> On Dec. 11, 2017, 6:57 p.m., Bill Farner wrote:
> > I should have looked more closely at the last patch, and when preparing my 
> > patch before that.  This test is bound to be flaky since there's nothing 
> > ensuring a happens-before releationship between issuing the request and 
> > checking the result.
> > 
> > I can repro consistently with the patch below, slowing down the 
> > `onThrowable` handler:
> > ```console
> > $ git diff
> > diff --git a/src/main/java/org/apache/aurora/scheduler/events/Webhook.java 
> > b/src/main/java/org/apache/aurora/scheduler/events/Webhook.java
> > index 95e3f52..d4519ee 100644
> > --- a/src/main/java/org/apache/aurora/scheduler/events/Webhook.java
> > +++ b/src/main/java/org/apache/aurora/scheduler/events/Webhook.java
> > @@ -102,6 +102,10 @@ public class Webhook extends AbstractIdleService 
> > implements EventSubscriber {
> >          createRequest(stateChange).execute(new 
> > AsyncCompletionHandler<Integer>() {
> >            @Override
> >            public void onThrowable(Throwable t) {
> > +            try {
> > +              Thread.sleep(1000);
> > +            } catch (InterruptedException e) {
> > +            }
> >              errorsCounter.incrementAndGet();
> >              LOG.error("Error sending a Webhook event", t);
> > ```
> > 
> > If i get some spare time, i may take a stab at a fix; but the 
> > happens-before is what we'll want to seek.
> 
> Jordan Ly wrote:
>     Interesting -- I would imagine this client would force the future to 
> complete.
>     
>     ```
>       /**
>        * Wrap the DefaultAsyncHttpClient for testing so we can complete 
> futures synchronously before
>        * validating assertions. Otherwise, we would have to call 
> `Thread.sleep` in our tests after
>        * each TaskStateChange.
>        */
>       static class WebhookAsyncHttpClientWrapper extends 
> DefaultAsyncHttpClient {
>     
>         WebhookAsyncHttpClientWrapper(DefaultAsyncHttpClientConfig config) {
>           super(config);
>         }
>     
>         @Override
>         public <T> ListenableFuture<T> 
> executeRequest(org.asynchttpclient.Request request,
>                                                       AsyncHandler<T> 
> handler) {
>           ListenableFuture<T> future = super.executeRequest(request, handler);
>           try {
>             future.get();
>             future.done();
>           } catch (InterruptedException | ExecutionException e) {
>             // Ignore, future should not fail to resolve.
>           }
>           return future;
>         }
>       }
>     ```
>     
>     Must be something flawed in the implementation.
> 
> Bill Farner wrote:
>     I re-remembered this code after posting the above reply; and i 
> agree...pretty surprising!


This patch fixes things for me; causing the `Thread.sleep()` to no longer 
affect test results:

```console
diff --git a/src/test/java/org/apache/aurora/scheduler/events/WebhookTest.java 
b/src/test/java/org/apache/aurora/scheduler/events/WebhookTest.java
index 1b5d2d0..cdcf50a 100644
--- a/src/test/java/org/apache/aurora/scheduler/events/WebhookTest.java
+++ b/src/test/java/org/apache/aurora/scheduler/events/WebhookTest.java
@@ -17,7 +17,6 @@ import java.io.IOException;
 import java.net.URI;
 import java.util.List;
 import java.util.Map;
-import java.util.concurrent.ExecutionException;
 import java.util.stream.Collectors;

 import javax.servlet.ServletException;
@@ -107,12 +106,7 @@ public class WebhookTest {
     public <T> ListenableFuture<T> executeRequest(org.asynchttpclient.Request 
request,
                                                   AsyncHandler<T> handler) {
       ListenableFuture<T> future = super.executeRequest(request, handler);
-      try {
-        future.get();
-        future.done();
-      } catch (InterruptedException | ExecutionException e) {
-        // Ignore, future should not fail to resolve.
-      }
+      future.toCompletableFuture().join();
       return future;
     }
   }
```

The `future.get()` alone does not work due to this code in 
`NettyResponseFuture`:
```java
    public final void abort(final Throwable t) {

        if (terminateAndExit())
            return;

        future.completeExceptionally(t);

        if (onThrowableCalledField.compareAndSet(this, 0, 1)) {
            try {
                asyncHandler.onThrowable(t);
            } catch (Throwable te) {
                LOGGER.debug("asyncHandler.onThrowable", te);
            }
        }
    }
```

The `future.completeExceptionally(t)` call causes `future.get()` above to throw 
`ExecutionException`, and our main test thread continues.  Netty's thread 
separately proceeds to call `asyncHandler.onThrowable(t)`, which races with the 
main thread.

I have not looked into the internals of `CompletableFuture#join()` to say why 
it differs, but it does pass consistently for me.


- Bill


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64523/#review193491
-----------------------------------------------------------


On Dec. 11, 2017, 6:29 p.m., Jordan Ly wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64523/
> -----------------------------------------------------------
> 
> (Updated Dec. 11, 2017, 6:29 p.m.)
> 
> 
> Review request for Aurora and Stephan Erb.
> 
> 
> Bugs: AURORA-1961
>     https://issues.apache.org/jira/browse/AURORA-1961
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Attempt #2 to fix the flaky Webhook test.
> 
> Along the same lines of Stephan's change 
> (https://reviews.apache.org/r/64482/), but opting to never start the Jetty 
> server and forcing a connect exception.
> 
> Overall, this flakiness has seemed to increase in volume over the past month. 
> I've been running into it with a noticable fraction of internal/test builds.
> 
> 
> Diffs
> -----
> 
>   src/test/java/org/apache/aurora/scheduler/events/WebhookTest.java 
> 1b5d2d02535345edfe6cf04d18d00434393f800b 
> 
> 
> Diff: https://reviews.apache.org/r/64523/diff/1/
> 
> 
> Testing
> -------
> 
> This change seems pretty hard to test considering the differences in 
> environment and the unknown cause of the actual errors. Maybe we run the 
> reviewbot on this repo repeatedly? Obviously not the most scientifically 
> sound solution.
> 
> 
> Thanks,
> 
> Jordan Ly
> 
>

Re: Review Request 64523: Attempt #2 to fix flaky Webhook test

Reply via email to