GitHub user srowen opened a pull request:

    https://github.com/apache/spark/pull/11763

    [SPARK-13823][HOTFIX] Increase tryAcquire timeout and assert it succeeds to 
fix failure on slow machines

    ## What changes were proposed in this pull request?
    
    I'm seeing several PR builder builds fail after 
https://github.com/apache/spark/pull/11725/files. Example:
    
    
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.4/lastFailedBuild/console
    
    ```
    testCommunication(org.apache.spark.launcher.LauncherServerSuite)  Time 
elapsed: 0.023 sec  <<< FAILURE!
    java.lang.AssertionError: expected:<app-id> but was:<null>
        at 
org.apache.spark.launcher.LauncherServerSuite.testCommunication(LauncherServerSuite.java:93)
    ```
    
    However, other builds pass this same test, including the test when run 
locally and on the Jenkins PR builder. The failure itself concerns a change to 
how the test waits on a condition, and the wait can time out; therefore I think 
this is due to fast/slow machine differences.
    
    This is an attempt at a hot fix; it's a little hard to verify since locally 
and on the PR builder, it passes anyway. The change itself should be harmless 
anyway.
    
    Why didn't this happen before, if the new logic was supposed to be 
equivalent to the old? I think this is the sequence:
    
    - First attempt to acquire semaphore for 10ms actually silently times out
    - The changed being waited for happens just after that, a bit too late
    - Assertion passes since condition became true just in time
    - `release()` fires from the listener
    - Next `tryAcquire` however immediately succeeds because the first 
`tryAcquire` didn't acquire anything, but its subsequent condition is not yet 
true; this would explain why the second one always fails
    
    Versus the original using `notifyAll()`, there's a small difference: 
`wait()`-ing after `notifyAll()` just results in another wait; it doesn't make 
it return immediately. So this was a tiny latent issue that was masked by the 
semantics. Now the test asserts that the event actually happened (semaphore was 
acquired). (The timeout is still here to prevent the test from hanging forever, 
and to detect really slow response.) The timeout is increased to a second to 
allow plenty of time anyway.
    
    ## How was this patch tested?
    
    Jenkins tests


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/srowen/spark SPARK-13823.3

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/11763.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #11763
    
----
commit dac72054fcf95eadd0a54210e2df9efc9ac74507
Author: Sean Owen <[email protected]>
Date:   2016-03-16T13:53:48Z

    Increase tryAcquire timeout and assert it succeeds to fix failure on slow 
machines

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to