[ https://issues.apache.org/jira/browse/SPARK-45579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Frank Yin updated SPARK-45579: ------------------------------ Description: During Spark executor decommission, the fallback storage uploads can fail due to some race conditions even though we check the actual file exists: ``` java.io.FileNotFoundException: No file: /var/data/spark-ab14b716-630d-435e-a92a-1403f6206dd8/blockmgr-7f9ab4d7-1340-4b39-9558-fde994a82090/0b/shuffle_175_66754_0.index at org.apache.hadoop.fs.s3a.impl.CopyFromLocalOperation.checkSource(CopyFromLocalOperation.java:314) at org.apache.hadoop.fs.s3a.impl.CopyFromLocalOperation.execute(CopyFromLocalOperation.java:167) at org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$copyFromLocalFile$26(S3AFileSystem.java:3854) at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.invokeTrackingDuration(IOStatisticsBinding.java:547) at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.lambda$trackDurationOfOperation$5(IOStatisticsBinding.java:528) at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.trackDuration(IOStatisticsBinding.java:449) at org.apache.hadoop.fs.s3a.S3AFileSystem.trackDurationAndSpan(S3AFileSystem.java:2480) at org.apache.hadoop.fs.s3a.S3AFileSystem.trackDurationAndSpan(S3AFileSystem.java:2499) at org.apache.hadoop.fs.s3a.S3AFileSystem.copyFromLocalFile(S3AFileSystem.java:3847) at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:2558) at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:2520) at org.apache.spark.storage.FallbackStorage.copy(FallbackStorage.scala:67) at org.apache.spark.storage.BlockManagerDecommissioner$ShuffleMigrationRunnable.$anonfun$run$12(BlockManagerDecommissioner.scala:146) at org.apache.spark.storage.BlockManagerDecommissioner$ShuffleMigrationRunnable.$anonfun$run$12$adapted(BlockManagerDecommissioner.scala:146) at scala.Option.foreach(Option.scala:407) at org.apache.spark.storage.BlockManagerDecommissioner$ShuffleMigrationRunnable.run(BlockManagerDecommissioner.scala:146) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) ``` This will block the executor from exiting properly because the decommissioner doesn't think shuffle migration is complete. was: During Spark executor decommission, the fallback storage uploads can fail due to some race conditions: > Executor hangs indefinitely due to decommissioner errors > -------------------------------------------------------- > > Key: SPARK-45579 > URL: https://issues.apache.org/jira/browse/SPARK-45579 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 3.5.0 > Reporter: Frank Yin > Priority: Major > > During Spark executor decommission, the fallback storage uploads can fail due > to some race conditions even though we check the actual file exists: > ``` > java.io.FileNotFoundException: No file: > /var/data/spark-ab14b716-630d-435e-a92a-1403f6206dd8/blockmgr-7f9ab4d7-1340-4b39-9558-fde994a82090/0b/shuffle_175_66754_0.index > at > org.apache.hadoop.fs.s3a.impl.CopyFromLocalOperation.checkSource(CopyFromLocalOperation.java:314) > at > org.apache.hadoop.fs.s3a.impl.CopyFromLocalOperation.execute(CopyFromLocalOperation.java:167) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$copyFromLocalFile$26(S3AFileSystem.java:3854) > at > org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.invokeTrackingDuration(IOStatisticsBinding.java:547) > at > org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.lambda$trackDurationOfOperation$5(IOStatisticsBinding.java:528) > at > org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.trackDuration(IOStatisticsBinding.java:449) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.trackDurationAndSpan(S3AFileSystem.java:2480) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.trackDurationAndSpan(S3AFileSystem.java:2499) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.copyFromLocalFile(S3AFileSystem.java:3847) > at > org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:2558) > at > org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:2520) > at > org.apache.spark.storage.FallbackStorage.copy(FallbackStorage.scala:67) > at > org.apache.spark.storage.BlockManagerDecommissioner$ShuffleMigrationRunnable.$anonfun$run$12(BlockManagerDecommissioner.scala:146) > at > org.apache.spark.storage.BlockManagerDecommissioner$ShuffleMigrationRunnable.$anonfun$run$12$adapted(BlockManagerDecommissioner.scala:146) > at scala.Option.foreach(Option.scala:407) > at > org.apache.spark.storage.BlockManagerDecommissioner$ShuffleMigrationRunnable.run(BlockManagerDecommissioner.scala:146) > at > java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) > at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.base/java.lang.Thread.run(Thread.java:829) > ``` > This will block the executor from exiting properly because the decommissioner > doesn't think shuffle migration is complete. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org