GitHub user vanzin opened a pull request:

    https://github.com/apache/spark/pull/20743

    [SPARK-23020][CORE][branch-2.3] Fix another race in the in-process launcher 
test.

    First the bad news: there's an unfixable race in the launcher code.
    (By unfixable I mean it would take a lot more effort than this change
    to fix it.) The good news is that it should only affect super short
    lived applications, such as the one run by the flaky test, so it's
    possible to work around it in our test.
    
    The fix also uncovered an issue with the recently added "closeAndWait()"
    method; closing the connection would still possibly cause data loss,
    so this change waits a while for the connection to finish itself, and
    closes the socket if that times out. The existing connection timeout
    is reused so that if desired it's possible to control how long to wait.
    
    As part of that I also restored the old behavior that disconnect() would
    force a disconnection from the child app; the "wait for data to arrive"
    approach is only taken when disposing of the handle.
    
    I tested this by inserting a bunch of sleeps in the test and the socket
    handling code in the launcher library; with those I was able to reproduce
    the error from the jenkins jobs. With the changes, even with all the
    sleeps still in place, all tests pass.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/vanzin/spark SPARK-23020

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20743.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20743
    
----
commit 5f3f4a621961f87488600c46d0cc55081e4c6ba6
Author: Marcelo Vanzin <vanzin@...>
Date:   2018-02-02T03:43:22Z

    [SPARK-23020][CORE] Fix another race in the in-process launcher test.
    
    First the bad news: there's an unfixable race in the launcher code.
    (By unfixable I mean it would take a lot more effort than this change
    to fix it.) The good news is that it should only affect super short
    lived applications, such as the one run by the flaky test, so it's
    possible to work around it in our test.
    
    The fix also uncovered an issue with the recently added "closeAndWait()"
    method; closing the connection would still possibly cause data loss,
    so this change waits a while for the connection to finish itself, and
    closes the socket if that times out. The existing connection timeout
    is reused so that if desired it's possible to control how long to wait.
    
    As part of that I also restored the old behavior that disconnect() would
    force a disconnection from the child app; the "wait for data to arrive"
    approach is only taken when disposing of the handle.
    
    I tested this by inserting a bunch of sleeps in the test and the socket
    handling code in the launcher library; with those I was able to reproduce
    the error from the jenkins jobs. With the changes, even with all the
    sleeps still in place, all tests pass.
    
    Author: Marcelo Vanzin <[email protected]>
    
    Closes #20462 from vanzin/SPARK-23020.
    
    (cherry picked from commit 969eda4a02faa7ca6cf3aff5cd10e6d51026b845)

commit 06aa292c15e61170b91f622dbce54a8149c99991
Author: Marcelo Vanzin <vanzin@...>
Date:   2018-03-06T00:28:52Z

    Re-enable test.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to