Dmitry Goldenberg created SPARK-44346:
-----------------------------------------
Summary: Python worker exited unexpectedly - java.io.EOFException
on DataInputStream.readInt - cluster doesn't terminate
Key: SPARK-44346
URL: https://issues.apache.org/jira/browse/SPARK-44346
Project: Spark
Issue Type: Question
Components: Spark Core
Affects Versions: 3.3.2
Environment: AWS EMR emr-6.11.0
Spark 3.3.2
pandas 1.3.5
pyarrow 12.0.0
"spark.sql.shuffle.partitions": "210",
"spark.default.parallelism": "210",
"spark.yarn.stagingDir": "hdfs:///tmp",
"spark.sql.adaptive.enabled": "true",
"spark.serializer": "org.apache.spark.serializer.KryoSerializer",
"spark.sql.execution.arrow.pyspark.enabled": "true",
"spark.dynamicAllocation.enabled": "false",
"hive.metastore.client.factory.class":
"com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory",
Reporter: Dmitry Goldenberg
I am getting the below exception as a WARN. Apparently, a worker crashes.
Multiple issues here:
- What is the cause of the crash? Is it something to do with pyarrow; some kind
of a versioning mismatch?
- Error handling in Spark. The error is too low-level to make sense of. Can it
be caught in Spark and dealth with properly?
- The cluster doesn't recover or cleanly terminate. It essentially just hangs.
EMR doesn't terminate it either.
Stack traces:
```
23/07/05 22:43:47 WARN TaskSetManager: Lost task 1.0 in stage 81.0 (TID 2761)
(ip-10-2-250-114.awsinternal.audiomack.com executor 2):
org.apache.spark.SparkException: Python worker exited unexpectedly (crashed)
at
org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:592)
at
org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:574)
at
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
at
org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:763)
at
org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:740)
at
org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:505)
at
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
Source)
at
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:35)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.hasNext(Unknown
Source)
at
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:955)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
at
org.apache.spark.sql.execution.AbstractUnsafeExternalRowSorter.sort(AbstractUnsafeExternalRowSorter.java:50)
at
org.apache.spark.sql.execution.SortExecBase.$anonfun$doExecute$1(SortExec.scala:346)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
at
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:138)
at
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1516)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at
org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:748)
... 26 more
23/07/05 22:43:47 INFO TaskSetManager: Starting task 1.1 in stage 81.0 (TID
2763) (ip-10-2-250-114.awsinternal.audiomack.com, executor 2, partition 1,
NODE_LOCAL, 5020 bytes) taskResourceAssignments Map()
23/07/05 23:30:17 INFO TaskSetManager: Finished task 2.0 in stage 81.0 (TID
2762) in 8603522 ms on ip-10-2-250-114.awsinternal.audiomack.com (executor 2)
(1/3)
23/07/05 23:39:09 INFO TaskSetManager: Finished task 0.0 in stage 81.0 (TID
2760) in 9135125 ms on ip-10-2-250-114.awsinternal.audiomack.com (executor 2)
(2/3)
23/07/06 00:04:41 WARN TaskSetManager: Lost task 1.1 in stage 81.0 (TID 2763)
(ip-10-2-250-114.awsinternal.audiomack.com executor 2):
org.apache.spark.SparkException: Python worker exited unexpectedly (crashed)
at
org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:592)
at
org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:574)
at
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
at
org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:763)
at
org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:740)
at
org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:505)
at
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
Source)
at
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:35)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.hasNext(Unknown
Source)
at
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:955)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
at
org.apache.spark.sql.execution.AbstractUnsafeExternalRowSorter.sort(AbstractUnsafeExternalRowSorter.java:50)
at
org.apache.spark.sql.execution.SortExecBase.$anonfun$doExecute$1(SortExec.scala:346)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
at
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:138)
at
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1516)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at
org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:748)
... 26 more
```
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]