Laurens created SPARK-34680: ------------------------------- Summary: Spark hangs when out of diskspace Key: SPARK-34680 URL: https://issues.apache.org/jira/browse/SPARK-34680 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.1.1, 3.0.1 Environment: Running Spark and Pyspark 3.1.1. with Hadoop 3.2.2 and Koalas 1.6.0.
Some environment variables: |Java Home|/usr/lib/jvm/java-11-openjdk-11.0.3.7-0.el7_6.x86_64| |Java Version|11.0.3 (Oracle Corporation)| |Scala Version|version 2.12.10| Reporter: Laurens Parsing a workflow using Koalas, I noticed a stage is hanging for 8 hours already. I checked the logs and the last output is: {code:java} 21/03/09 13:50:31 ERROR TaskMemoryManager: error while calling spill() on org.apache.spark.shuffle.sort.ShuffleExternalSorter@4127a515 java.io.IOException: No space left on device at java.base/java.io.FileOutputStream.writeBytes(Native Method) at java.base/java.io.FileOutputStream.write(FileOutputStream.java:354) at org.apache.spark.storage.TimeTrackingOutputStream.write(TimeTrackingOutputStream.java:59) at java.base/java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:81) at java.base/java.io.BufferedOutputStream.write(BufferedOutputStream.java:127) at net.jpountz.lz4.LZ4BlockOutputStream.flushBufferedData(LZ4BlockOutputStream.java:223) at net.jpountz.lz4.LZ4BlockOutputStream.write(LZ4BlockOutputStream.java:176) at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:260) at org.apache.spark.shuffle.sort.ShuffleExternalSorter.writeSortedFile(ShuffleExternalSorter.java:218) at org.apache.spark.shuffle.sort.ShuffleExternalSorter.spill(ShuffleExternalSorter.java:276) at org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:208) at org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:289) at org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:116) at org.apache.spark.shuffle.sort.ShuffleExternalSorter.acquireNewPageIfNecessary(ShuffleExternalSorter.java:385) at org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:409) at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:249) at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:178) at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:834) Suppressed: java.io.IOException: No space left on device at java.base/java.io.FileOutputStream.writeBytes(Native Method) at java.base/java.io.FileOutputStream.write(FileOutputStream.java:354) at org.apache.spark.storage.TimeTrackingOutputStream.write(TimeTrackingOutputStream.java:59) at java.base/java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:81) at java.base/java.io.BufferedOutputStream.flush(BufferedOutputStream.java:142) at net.jpountz.lz4.LZ4BlockOutputStream.flush(LZ4BlockOutputStream.java:243) at org.apache.spark.serializer.DummySerializerInstance$1.flush(DummySerializerInstance.java:50) at org.apache.spark.storage.DiskBlockObjectWriter.commitAndGet(DiskBlockObjectWriter.scala:173) at org.apache.spark.storage.DiskBlockObjectWriter.$anonfun$close$1(DiskBlockObjectWriter.scala:156) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) at org.apache.spark.storage.DiskBlockObjectWriter.close(DiskBlockObjectWriter.scala:158) at org.apache.spark.shuffle.sort.ShuffleExternalSorter.writeSortedFile(ShuffleExternalSorter.java:226) ... 18 more Suppressed: java.io.IOException: No space left on device at java.base/java.io.FileOutputStream.writeBytes(Native Method) at java.base/java.io.FileOutputStream.write(FileOutputStream.java:354) at org.apache.spark.storage.TimeTrackingOutputStream.write(TimeTrackingOutputStream.java:59) at java.base/java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:81) at java.base/java.io.BufferedOutputStream.flush(BufferedOutputStream.java:142) at java.base/java.io.FilterOutputStream.close(FilterOutputStream.java:182) at org.apache.spark.storage.DiskBlockObjectWriter$ManualCloseBufferedOutputStream$1.org$apache$spark$storage$DiskBlockObjectWriter$ManualCloseOutputStream$$super$close(DiskBlockObjectWriter.scala:108) at org.apache.spark.storage.DiskBlockObjectWriter$ManualCloseOutputStream.manualClose(DiskBlockObjectWriter.scala:65) at org.apache.spark.storage.DiskBlockObjectWriter$ManualCloseOutputStream.manualClose$(DiskBlockObjectWriter.scala:64) at org.apache.spark.storage.DiskBlockObjectWriter$ManualCloseBufferedOutputStream$1.manualClose(DiskBlockObjectWriter.scala:108) at org.apache.spark.storage.DiskBlockObjectWriter.$anonfun$closeResources$1(DiskBlockObjectWriter.scala:135) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) at org.apache.spark.storage.DiskBlockObjectWriter.closeResources(DiskBlockObjectWriter.scala:136) at org.apache.spark.storage.DiskBlockObjectWriter.$anonfun$close$2(DiskBlockObjectWriter.scala:158) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1448) ... 20 more 21/03/09 13:50:31 INFO TaskMemoryManager: Memory used in task 1255 21/03/09 13:50:31 INFO TaskMemoryManager: Acquired by HybridRowQueue(org.apache.spark.memory.TaskMemoryManager@394bad48,/local/anonymized/spark/spark-4b70492b-8f2e-4108-b6a0-6ed423a98bd9/executor-b88a6782-4592-45c0-a484-73a2f642cb3e/spark-c20b49eb-83d4-4145-b07a-fe6fddef7ffe,7,org.apache.spark.serializer.SerializerManager@59dd92e8): 105.5 MiB 21/03/09 13:50:31 INFO TaskMemoryManager: Acquired by org.apache.spark.shuffle.sort.ShuffleExternalSorter@4127a515: 14.4 GiB 21/03/09 13:50:31 INFO TaskMemoryManager: Acquired by org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@34a4b163: 15.1 GiB 21/03/09 13:50:31 INFO TaskMemoryManager: 67108864 bytes of memory were used by task 1255 but are not associated with specific consumers 21/03/09 13:50:31 INFO TaskMemoryManager: 31853114929 bytes of memory are used for execution and 526799 bytes of memory are used for storage{code} Local time is 21/03/09 21:33:00, so it appears the worker is stuck and the stage is not terminating unsuccessfully. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org