This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
     new 56c65d58729d [SPARK-23015][WINDOWS] Fix bug in Windows where starting 
multiple Spark instances within the same second causes a failure
56c65d58729d is described below

commit 56c65d58729dba4abf9dd039a37e05bcb7e79526
Author: Zachary Steudel <[email protected]>
AuthorDate: Wed May 29 11:08:35 2024 +0900

    [SPARK-23015][WINDOWS] Fix bug in Windows where starting multiple Spark 
instances within the same second causes a failure
    
    ### What changes were proposed in this pull request?
    **Problem**
    If you attempt to start multiple Spark instances within a second there is a 
high likelihood that this spark-class-launcher-output file will have the same 
name for multiple instances, causing the Spark launcher to fail. The error will 
look something like this:
    
    ```
    WARNING - The process cannot access the file because it is being used by 
another process.
    WARNING - The system cannot find the file 
C:\Users\CURRENTUSER\AppData\Local\Temp\spark-class-launcher-output-21229.txt.
    WARNING - The process cannot access the file because it is being used by 
another process.
    ```
    
    Windows' %RANDOM% is seeded with 1-second granularity. We often start ~20 
instances at the same time daily in Windows and encounter this bug on a weekly 
basis
    
    **Proposed Fix**
    Instead of relying on %RANDOM% which has poor granularity, use Powershell 
to generate a GUID and append that to the end of the temp file name. We have 
been using this in production for around 2-3 months and have never encountered 
this bug since.
    
    ### Why are the changes needed?
    My team runs Spark on Windows and we boot up 20+ instances within a few 
seconds on a daily basis. We encountered this bug weekly and have taken steps 
to mitigate it without changing the Spark source code like adding a random 
sleep between 1-300 seconds before starting Spark. Even with a random sleep, 
20+ instances have a likelihood of sleeping a similar amount of time and 
starting at the same time. Also, relying on a random sleep before starting 
Spark is clunky, unreliable, and not a  [...]
    
    Eventually our team went ahead and edited the code in this .cmd file with 
this fix. I figured I should make a pull request for this as well.
    
    ### Does this PR introduce _any_ user-facing change?
    no
    
    ### How was this patch tested?
    You can pretty reliably recreate this bug by submitting 30 Spark jobs in 
Windows using spark-submit. Eventually the Spark launcher will overlap with 
another Spark launcher and fail.
    
    You can pull my fixed spark-class2.cmd and try this again and there should 
be no incidence of this bug.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    no
    
    Closes #43706 from Rafnel/patch-1.
    
    Authored-by: Zachary Steudel <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
---
 bin/spark-class2.cmd | 22 ++++++++++++++++------
 1 file changed, 16 insertions(+), 6 deletions(-)

diff --git a/bin/spark-class2.cmd b/bin/spark-class2.cmd
index 800ec0c02c22..8703f5a86f10 100755
--- a/bin/spark-class2.cmd
+++ b/bin/spark-class2.cmd
@@ -61,14 +61,24 @@ if not "x%JAVA_HOME%"=="x" (
   )
 )
 
+rem SPARK-23015: We create a temporary text file when launching Spark. 
+rem This file must be given a unique name or else we risk a race condition 
when launching multiple instances close together.
+rem The best way to create a unique file name is to add a GUID to the file 
name. Use Powershell to generate the GUID.
+where powershell.exe >nul 2>&1
+if %errorlevel%==0 (
+  FOR /F %%a IN ('POWERSHELL -COMMAND "$([guid]::NewGuid().ToString())"') DO 
(set RANDOM_SUFFIX=%%a)
+) else (
+  rem If Powershell is not installed, try to create a random file name suffix 
using the Windows %RANDOM%.
+  rem %RANDOM% is seeded with 1-second granularity so it is highly likely that 
two Spark instances
+  rem launched within the same second will fail to start.
+  rem Note that Powershell is automatically installed on all Windows OS from 
Windows 7/Windows Server 2008 R2 and onward.
+  set RANDOM_SUFFIX=%RANDOM%
+)
+
 rem The launcher library prints the command to be executed in a single line 
suitable for being
 rem executed by the batch interpreter. So read all the output of the launcher 
into a variable.
-:gen
-set LAUNCHER_OUTPUT=%temp%\spark-class-launcher-output-%RANDOM%.txt
-rem SPARK-28302: %RANDOM% would return the same number if we call it instantly 
after last call,
-rem so we should make it sure to generate unique file to avoid process 
collision of writing into
-rem the same file concurrently.
-if exist %LAUNCHER_OUTPUT% goto :gen
+set LAUNCHER_OUTPUT=%temp%\spark-class-launcher-output-%RANDOM_SUFFIX%.txt
+
 rem unset SHELL to indicate non-bash environment to launcher/Main
 set SHELL=
 "%RUNNER%" -Xmx128m -cp "%LAUNCH_CLASSPATH%" org.apache.spark.launcher.Main %* 
> %LAUNCHER_OUTPUT%


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to