GitHub user vanzin opened a pull request:
https://github.com/apache/spark/pull/20743
[SPARK-23020][CORE][branch-2.3] Fix another race in the in-process launcher
test.
First the bad news: there's an unfixable race in the launcher code.
(By unfixable I mean it would take a lot more effort than this change
to fix it.) The good news is that it should only affect super short
lived applications, such as the one run by the flaky test, so it's
possible to work around it in our test.
The fix also uncovered an issue with the recently added "closeAndWait()"
method; closing the connection would still possibly cause data loss,
so this change waits a while for the connection to finish itself, and
closes the socket if that times out. The existing connection timeout
is reused so that if desired it's possible to control how long to wait.
As part of that I also restored the old behavior that disconnect() would
force a disconnection from the child app; the "wait for data to arrive"
approach is only taken when disposing of the handle.
I tested this by inserting a bunch of sleeps in the test and the socket
handling code in the launcher library; with those I was able to reproduce
the error from the jenkins jobs. With the changes, even with all the
sleeps still in place, all tests pass.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/vanzin/spark SPARK-23020
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/20743.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #20743
----
commit 5f3f4a621961f87488600c46d0cc55081e4c6ba6
Author: Marcelo Vanzin <vanzin@...>
Date: 2018-02-02T03:43:22Z
[SPARK-23020][CORE] Fix another race in the in-process launcher test.
First the bad news: there's an unfixable race in the launcher code.
(By unfixable I mean it would take a lot more effort than this change
to fix it.) The good news is that it should only affect super short
lived applications, such as the one run by the flaky test, so it's
possible to work around it in our test.
The fix also uncovered an issue with the recently added "closeAndWait()"
method; closing the connection would still possibly cause data loss,
so this change waits a while for the connection to finish itself, and
closes the socket if that times out. The existing connection timeout
is reused so that if desired it's possible to control how long to wait.
As part of that I also restored the old behavior that disconnect() would
force a disconnection from the child app; the "wait for data to arrive"
approach is only taken when disposing of the handle.
I tested this by inserting a bunch of sleeps in the test and the socket
handling code in the launcher library; with those I was able to reproduce
the error from the jenkins jobs. With the changes, even with all the
sleeps still in place, all tests pass.
Author: Marcelo Vanzin <[email protected]>
Closes #20462 from vanzin/SPARK-23020.
(cherry picked from commit 969eda4a02faa7ca6cf3aff5cd10e6d51026b845)
commit 06aa292c15e61170b91f622dbce54a8149c99991
Author: Marcelo Vanzin <vanzin@...>
Date: 2018-03-06T00:28:52Z
Re-enable test.
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]