[
https://issues.apache.org/jira/browse/SPARK-35401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17345609#comment-17345609
]
Hyukjin Kwon commented on SPARK-35401:
--------------------------------------
[~nafarmer] mind elabourating your enviornment?
> java.io.StreamCorruptedException: invalid stream header: 204356EC when using
> toPandas() with PySpark
> ----------------------------------------------------------------------------------------------------
>
> Key: SPARK-35401
> URL: https://issues.apache.org/jira/browse/SPARK-35401
> Project: Spark
> Issue Type: Bug
> Components: Java API
> Affects Versions: 3.1.1
> Reporter: Nathan Farmer
> Priority: Minor
>
> Whenever I try to read a Spark dataset using PySpark and convert it to a
> Pandas df for modeling I get the error: `java.io.StreamCorruptedException:
> invalid stream header: 204356EC` on the toPandas() step.Whenever I try to
> read a Spark dataset using PySpark and convert it to a Pandas df for modeling
> I get the error: `java.io.StreamCorruptedException: invalid stream header:
> 204356EC` on the toPandas() step.
> I am not a Java coder (hence PySpark) and so these errors can be pretty
> cryptic to me. I tried the following things, but I still have this issue:
> * Made sure my Spark and PySpark versions matched as suggested here:
> [java.io.StreamCorruptedException when importing a CSV to a Spark
> DataFrame|https://stackoverflow.com/questions/53286071/java-io-streamcorruptedexception-when-importing-a-csv-to-a-spark-dataframe]
> * Reinstalled Spark using the methods suggested here: [Complete Guide to
> Installing PySpark on
> MacOS|https://kevinvecmanis.io/python/pyspark/install/2019/05/31/Installing-Apache-Spark.html]
> The logging in the test script below verifies the Spark and PySpark versions
> are aligned.
> test.py:
> {code:python}
> import logging
> from pyspark.sql import SparkSessionfrom pyspark import SparkContext
> import findsparkfindspark.init()
> logging.basicConfig( format='%(asctime)s %(levelname)-8s %(message)s',
> level=logging.INFO, datefmt='%Y-%m-%d %H:%M:%S')
> sc = SparkContext('local[*]', 'test')spark =
> SparkSession(sc)logging.info('Spark location:
> {}'.format(findspark.find()))logging.info('PySpark version:
> {}'.format(spark.sparkContext.version))
> logging.info('Reading spark input dataframe')test_df =
> spark.read.csv('./data', header=True, sep='|', inferSchema=True)
> logging.info('Converting spark DF to pandas DF')pandas_df =
> test_df.toPandas()logging.info('DF record count:
> {}'.format(len(pandas_df)))sc.stop()
> {code}
> Output:
> {code:java}
> $ python ./test.py 21/05/13 11:54:32 WARN NativeCodeLoader: Unable to load
> native-hadoop library for your platform... using builtin-java classes where
> applicableUsing Spark's default log4j profile:
> org/apache/spark/log4j-defaults.propertiesSetting default log level to
> "WARN".To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use
> setLogLevel(newLevel).2021-05-13 11:54:34 INFO Spark location:
> /Users/username/server/spark-3.1.1-bin-hadoop2.72021-05-13 11:54:34 INFO
> PySpark version: 3.1.12021-05-13 11:54:34 INFO Reading spark input
> dataframe2021-05-13 11:54:42 INFO Converting spark DF to pandas DF
> 21/05/13 11:54:42 WARN package: Truncated the string
> representation of a plan since it was too large. This behavior can be
> adjusted by setting 'spark.sql.debug.maxToStringFields'.21/05/13 11:54:45
> ERROR TaskResultGetter: Exception while getting task
> result12]java.io.StreamCorruptedException: invalid stream header: 204356EC at
> java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:936) at
> java.io.ObjectInputStream.<init>(ObjectInputStream.java:394) at
> org.apache.spark.serializer.JavaDeserializationStream$$anon$1.<init>(JavaSerializer.scala:64)
> at
> org.apache.spark.serializer.JavaDeserializationStream.<init>(JavaSerializer.scala:64)
> at
> org.apache.spark.serializer.JavaSerializerInstance.deserializeStream(JavaSerializer.scala:123)
> at
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:108)
> at
> org.apache.spark.scheduler.TaskResultGetter$$anon$3.$anonfun$run$1(TaskResultGetter.scala:97)
> at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1996) at
> org.apache.spark.scheduler.TaskResultGetter$$anon$3.run(TaskResultGetter.scala:63)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)Traceback (most recent call last):
> File "./test.py", line 23, in <module> pandas_df = test_df.toPandas()
> File
> "/Users/username/server/spark-3.1.1-bin-hadoop2.7/python/pyspark/sql/pandas/conversion.py",
> line 141, in toPandas pdf = pd.DataFrame.from_records(self.collect(),
> columns=self.columns) File
> "/Users/username/server/spark-3.1.1-bin-hadoop2.7/python/pyspark/sql/dataframe.py",
> line 677, in collect sock_info = self._jdf.collectToPython() File
> "/Users/username/server/spark-3.1.1-bin-hadoop2.7/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py",
> line 1304, in __call__ File
> "/Users/username/server/spark-3.1.1-bin-hadoop2.7/python/pyspark/sql/utils.py",
> line 111, in deco return f(*a, **kw) File
> "/Users/username/server/spark-3.1.1-bin-hadoop2.7/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py",
> line 326, in get_return_valuepy4j.protocol.Py4JJavaError: An error occurred
> while calling o31.collectToPython.: org.apache.spark.SparkException: Job
> aborted due to stage failure: Exception while getting task result:
> java.io.StreamCorruptedException: invalid stream header: 204356EC at
> org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2253)
> at
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2202)
> at
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2201)
> at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
> at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2201)
> at
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1078)
> at
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1078)
> at scala.Option.foreach(Option.scala:407) at
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1078)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2440)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2382)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2371)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:868) at
> org.apache.spark.SparkContext.runJob(SparkContext.scala:2202) at
> org.apache.spark.SparkContext.runJob(SparkContext.scala:2223) at
> org.apache.spark.SparkContext.runJob(SparkContext.scala:2242) at
> org.apache.spark.SparkContext.runJob(SparkContext.scala:2267) at
> org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030) at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:414) at
> org.apache.spark.rdd.RDD.collect(RDD.scala:1029) at
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:390)
> at
> org.apache.spark.sql.Dataset.$anonfun$collectToPython$1(Dataset.scala:3519)
> at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3687) at
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
> at
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
> at
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
> at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772) at
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
> at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3685) at
> org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:3516) at
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498) at
> py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at
> py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at
> py4j.Gateway.invoke(Gateway.java:282) at
> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at
> py4j.commands.CallCommand.execute(CallCommand.java:79) at
> py4j.GatewayConnection.run(GatewayConnection.java:238) at
> java.lang.Thread.run(Thread.java:748){code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]