This is an automated email from the ASF dual-hosted git repository.
gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push:
new 56c65d58729d [SPARK-23015][WINDOWS] Fix bug in Windows where starting
multiple Spark instances within the same second causes a failure
56c65d58729d is described below
commit 56c65d58729dba4abf9dd039a37e05bcb7e79526
Author: Zachary Steudel <[email protected]>
AuthorDate: Wed May 29 11:08:35 2024 +0900
[SPARK-23015][WINDOWS] Fix bug in Windows where starting multiple Spark
instances within the same second causes a failure
### What changes were proposed in this pull request?
**Problem**
If you attempt to start multiple Spark instances within a second there is a
high likelihood that this spark-class-launcher-output file will have the same
name for multiple instances, causing the Spark launcher to fail. The error will
look something like this:
```
WARNING - The process cannot access the file because it is being used by
another process.
WARNING - The system cannot find the file
C:\Users\CURRENTUSER\AppData\Local\Temp\spark-class-launcher-output-21229.txt.
WARNING - The process cannot access the file because it is being used by
another process.
```
Windows' %RANDOM% is seeded with 1-second granularity. We often start ~20
instances at the same time daily in Windows and encounter this bug on a weekly
basis
**Proposed Fix**
Instead of relying on %RANDOM% which has poor granularity, use Powershell
to generate a GUID and append that to the end of the temp file name. We have
been using this in production for around 2-3 months and have never encountered
this bug since.
### Why are the changes needed?
My team runs Spark on Windows and we boot up 20+ instances within a few
seconds on a daily basis. We encountered this bug weekly and have taken steps
to mitigate it without changing the Spark source code like adding a random
sleep between 1-300 seconds before starting Spark. Even with a random sleep,
20+ instances have a likelihood of sleeping a similar amount of time and
starting at the same time. Also, relying on a random sleep before starting
Spark is clunky, unreliable, and not a [...]
Eventually our team went ahead and edited the code in this .cmd file with
this fix. I figured I should make a pull request for this as well.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
You can pretty reliably recreate this bug by submitting 30 Spark jobs in
Windows using spark-submit. Eventually the Spark launcher will overlap with
another Spark launcher and fail.
You can pull my fixed spark-class2.cmd and try this again and there should
be no incidence of this bug.
### Was this patch authored or co-authored using generative AI tooling?
no
Closes #43706 from Rafnel/patch-1.
Authored-by: Zachary Steudel <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
---
bin/spark-class2.cmd | 22 ++++++++++++++++------
1 file changed, 16 insertions(+), 6 deletions(-)
diff --git a/bin/spark-class2.cmd b/bin/spark-class2.cmd
index 800ec0c02c22..8703f5a86f10 100755
--- a/bin/spark-class2.cmd
+++ b/bin/spark-class2.cmd
@@ -61,14 +61,24 @@ if not "x%JAVA_HOME%"=="x" (
)
)
+rem SPARK-23015: We create a temporary text file when launching Spark.
+rem This file must be given a unique name or else we risk a race condition
when launching multiple instances close together.
+rem The best way to create a unique file name is to add a GUID to the file
name. Use Powershell to generate the GUID.
+where powershell.exe >nul 2>&1
+if %errorlevel%==0 (
+ FOR /F %%a IN ('POWERSHELL -COMMAND "$([guid]::NewGuid().ToString())"') DO
(set RANDOM_SUFFIX=%%a)
+) else (
+ rem If Powershell is not installed, try to create a random file name suffix
using the Windows %RANDOM%.
+ rem %RANDOM% is seeded with 1-second granularity so it is highly likely that
two Spark instances
+ rem launched within the same second will fail to start.
+ rem Note that Powershell is automatically installed on all Windows OS from
Windows 7/Windows Server 2008 R2 and onward.
+ set RANDOM_SUFFIX=%RANDOM%
+)
+
rem The launcher library prints the command to be executed in a single line
suitable for being
rem executed by the batch interpreter. So read all the output of the launcher
into a variable.
-:gen
-set LAUNCHER_OUTPUT=%temp%\spark-class-launcher-output-%RANDOM%.txt
-rem SPARK-28302: %RANDOM% would return the same number if we call it instantly
after last call,
-rem so we should make it sure to generate unique file to avoid process
collision of writing into
-rem the same file concurrently.
-if exist %LAUNCHER_OUTPUT% goto :gen
+set LAUNCHER_OUTPUT=%temp%\spark-class-launcher-output-%RANDOM_SUFFIX%.txt
+
rem unset SHELL to indicate non-bash environment to launcher/Main
set SHELL=
"%RUNNER%" -Xmx128m -cp "%LAUNCH_CLASSPATH%" org.apache.spark.launcher.Main %*
> %LAUNCHER_OUTPUT%
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]