GitHub user vanzin opened a pull request:
https://github.com/apache/spark/pull/20388
[SPARK-23020][core] Fix race in SparkAppHandle cleanup, again.
Third time is the charm?
There was still a race that was left in previous attempts. If the handle
closes the connection, the close() implementation would clean up state
that would prevent the thread from waiting on the connection thread to
finish. That could cause the race causing the test flakiness reported
in the bug.
The fix is to move the "wait for connection thread" code to a separate
close method that is used by the handle; that also simplifies the code
a bit and makes it also easier to follow.
I included an unrelated, but correct, change to a YARN test so that
it triggers when the PR is built.
Tested by inserting a sleep in the connection thread to mimic the race;
test failed reliably with the sleep, passes now. (Sleep not included in
the patch.) Also ran YARN tests to make sure.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/vanzin/spark SPARK-23020
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/20388.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #20388
----
commit fb14eaa918509f124a2a75155f8199b28de9a183
Author: Marcelo Vanzin <vanzin@...>
Date: 2018-01-24T21:48:52Z
[SPARK-23020][core] Fix race in SparkAppHandle cleanup, again.
Third time is the charm?
There was still a race that was left in previous attempts. If the handle
closes the connection, the close() implementation would clean up state
that would prevent the thread from waiting on the connection thread to
finish. That could cause the race causing the test flakiness reported
in the bug.
The fix is to move the "wait for connection thread" code to a separate
close method that is used by the handle; that also simplifies the code
a bit and makes it also easier to follow.
I included an unrelated, but correct, change to a YARN test so that
it triggers when the PR is built.
Tested by inserting a sleep in the connection thread to mimic the race;
test failed reliably with the sleep, passes now. (Sleep not included in
the patch.) Also ran YARN tests to make sure.
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]