attilapiros opened a new pull request, #48683:
URL: https://github.com/apache/spark/pull/48683

   
   ### What changes were proposed in this pull request?
   
   Fixing race condition in the test "SPARK-46957: Migrated shuffle files 
should be able to cleanup from executor" of 
`BlockManagerDecommissionIntegrationSuite`.
   
   There are at least two race conditions in this test:
   
   1) The `SparkListener` is running in a different thread from the main thread 
of the unit test so `shuffleBlockUpdates` must be accessed from synchronised 
block.  For this `ConcurrentLinkedQueue` is used similarly to `taskEndEvents` 
in the `runDecomTest()` method. There was not reported error for this yet.
   
   2) The `SparkListener` is informed earlier about a removed executor than the 
block manager `stop()` is called/finished. So latter when the test recursively 
iterating over the files to find the shuffle files the block manager's `stop()` 
might be running and deleting the underlying files recursively via the disk 
block manager. This leads to `java.nio.file.NoSuchFileException`, like:
   
   ```
   - SPARK-46957: Migrated shuffle files should be able to cleanup from 
executor *** FAILED ***
   18848  java.io.UncheckedIOException: java.nio.file.NoSuchFileException: 
/home/runner/work/spark/spark/core/target/tmp/spark-87f59bc6-b996-42cd-9775-2f704b67f773/executor-e0a030d4-434b-46b3-bdbf-81d4908bb0f5/blockmgr-d88ed5dd-c1c8-4713-9433-10694e736a8e/3a
   18849  at 
java.base/java.nio.file.FileTreeIterator.fetchNextIfNeeded(FileTreeIterator.java:87)
   18850  at 
java.base/java.nio.file.FileTreeIterator.hasNext(FileTreeIterator.java:103)
   18851  at java.base/java.util.Iterator.forEachRemaining(Iterator.java:132)
   18852  at 
java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1939)
   18853  at 
java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
   18854  at 
java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
   18855  at 
java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:921)
   18856  at 
java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
   18857  at 
java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:682)
   18858  at org.apache.commons.io.FileUtils.toList(FileUtils.java:3025)
   18859  ...
   18860  Cause: java.nio.file.NoSuchFileException: 
/home/runner/work/spark/spark/core/target/tmp/spark-87f59bc6-b996-42cd-9775-2f704b67f773/executor-e0a030d4-434b-46b3-bdbf-81d4908bb0f5/blockmgr-d88ed5dd-c1c8-4713-9433-10694e736a8e/3a
   18861  at 
java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
   18862  at 
java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106)
   18863  at 
java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
   18864  at 
java.base/sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55)
   18865  at 
java.base/sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:171)
   18866  at 
java.base/sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99)
   18867  at java.base/java.nio.file.Files.readAttributes(Files.java:1853)
   18868  at 
java.base/java.nio.file.FileTreeWalker.getAttributes(FileTreeWalker.java:226)
   18869  at 
java.base/java.nio.file.FileTreeWalker.visit(FileTreeWalker.java:277)
   18870  at 
java.base/java.nio.file.FileTreeWalker.next(FileTreeWalker.java:374)
   18871  ... 
   ```  
   
   To address this issue the local directory used by the decommissioned 
executor was find out and the wait is changed from getting the 
`ExecutorRemoved` event  to  wait for  the delete of this directory to be 
finished. This way the directory tree walked after the delete.
   
   Disclaimer:
   
   The jira also mentions failures like:
   ```
   [info] - SPARK-46957: Migrated shuffle files should be able to cleanup from 
executor *** FAILED *** (35 seconds, 200 milliseconds)
   15718[info]   0 was not greater than or equal to 4 
(BlockManagerDecommissionIntegrationSuite.scala:423)
   15719[info]   org.scalatest.exceptions.TestFailedException:
   15720[info]   at 
org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
   15721[info]   at 
org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
   15722[info]   at 
org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
   15723[info]   at 
org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
   15724[info]   at 
org.apache.spark.storage.BlockManagerDecommissionIntegrationSuite.$anonfun$new$10(BlockManagerDecommissionIntegrationSuite.scala:423)
   ```
   
   In my reproduction only the failure with `java.nio.file.NoSuchFileException` 
was  coming so it might be there are still something to fix even after this 
changes.
   
   ### Why are the changes needed?
   
   As Maven daily test with Java 21 was failing from time to time (flaky 
because of the race condition).
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   It was tested by running the existing unit test repeatedly.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to