[ 
https://issues.apache.org/jira/browse/SPARK-35401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17345609#comment-17345609
 ] 

Hyukjin Kwon commented on SPARK-35401:
--------------------------------------

[~nafarmer] mind elabourating your enviornment?

> java.io.StreamCorruptedException: invalid stream header: 204356EC when using 
> toPandas() with PySpark
> ----------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-35401
>                 URL: https://issues.apache.org/jira/browse/SPARK-35401
>             Project: Spark
>          Issue Type: Bug
>          Components: Java API
>    Affects Versions: 3.1.1
>            Reporter: Nathan Farmer
>            Priority: Minor
>
> Whenever I try to read a Spark dataset using PySpark and convert it to a 
> Pandas df for modeling I get the error: `java.io.StreamCorruptedException: 
> invalid stream header: 204356EC` on the toPandas() step.Whenever I try to 
> read a Spark dataset using PySpark and convert it to a Pandas df for modeling 
> I get the error: `java.io.StreamCorruptedException: invalid stream header: 
> 204356EC` on the toPandas() step.
> I am not a Java coder (hence PySpark) and so these errors can be pretty 
> cryptic to me. I tried the following things, but I still have this issue:
>  * Made sure my Spark and PySpark versions matched as suggested here: 
> [java.io.StreamCorruptedException when importing a CSV to a Spark 
> DataFrame|https://stackoverflow.com/questions/53286071/java-io-streamcorruptedexception-when-importing-a-csv-to-a-spark-dataframe]
>  * Reinstalled Spark using the methods suggested here: [Complete Guide to 
> Installing PySpark on 
> MacOS|https://kevinvecmanis.io/python/pyspark/install/2019/05/31/Installing-Apache-Spark.html]
> The logging in the test script below verifies the Spark and PySpark versions 
> are aligned.
> test.py:
> {code:python}
> import logging
> from pyspark.sql import SparkSessionfrom pyspark import SparkContext
> import findsparkfindspark.init()
> logging.basicConfig(    format='%(asctime)s %(levelname)-8s %(message)s',    
> level=logging.INFO,    datefmt='%Y-%m-%d %H:%M:%S')
> sc = SparkContext('local[*]', 'test')spark = 
> SparkSession(sc)logging.info('Spark location: 
> {}'.format(findspark.find()))logging.info('PySpark version: 
> {}'.format(spark.sparkContext.version))
> logging.info('Reading spark input dataframe')test_df = 
> spark.read.csv('./data', header=True, sep='|', inferSchema=True)
> logging.info('Converting spark DF to pandas DF')pandas_df = 
> test_df.toPandas()logging.info('DF record count: 
> {}'.format(len(pandas_df)))sc.stop()
> {code}
> Output:
> {code:java}
> $ python ./test.py   21/05/13 11:54:32 WARN NativeCodeLoader: Unable to load 
> native-hadoop library for your platform... using builtin-java classes where 
> applicableUsing Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.propertiesSetting default log level to 
> "WARN".To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).2021-05-13 11:54:34 INFO     Spark location: 
> /Users/username/server/spark-3.1.1-bin-hadoop2.72021-05-13 11:54:34 INFO     
> PySpark version: 3.1.12021-05-13 11:54:34 INFO     Reading spark input 
> dataframe2021-05-13 11:54:42 INFO     Converting spark DF to pandas DF        
>            21/05/13 11:54:42 WARN package: Truncated the string 
> representation of a plan since it was too large. This behavior can be 
> adjusted by setting 'spark.sql.debug.maxToStringFields'.21/05/13 11:54:45 
> ERROR TaskResultGetter: Exception while getting task 
> result12]java.io.StreamCorruptedException: invalid stream header: 204356EC at 
> java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:936) at 
> java.io.ObjectInputStream.<init>(ObjectInputStream.java:394) at 
> org.apache.spark.serializer.JavaDeserializationStream$$anon$1.<init>(JavaSerializer.scala:64)
>  at 
> org.apache.spark.serializer.JavaDeserializationStream.<init>(JavaSerializer.scala:64)
>  at 
> org.apache.spark.serializer.JavaSerializerInstance.deserializeStream(JavaSerializer.scala:123)
>  at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:108)
>  at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$3.$anonfun$run$1(TaskResultGetter.scala:97)
>  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at 
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1996) at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$3.run(TaskResultGetter.scala:63)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)Traceback (most recent call last):  
> File "./test.py", line 23, in <module>    pandas_df = test_df.toPandas()  
> File 
> "/Users/username/server/spark-3.1.1-bin-hadoop2.7/python/pyspark/sql/pandas/conversion.py",
>  line 141, in toPandas    pdf = pd.DataFrame.from_records(self.collect(), 
> columns=self.columns)  File 
> "/Users/username/server/spark-3.1.1-bin-hadoop2.7/python/pyspark/sql/dataframe.py",
>  line 677, in collect    sock_info = self._jdf.collectToPython()  File 
> "/Users/username/server/spark-3.1.1-bin-hadoop2.7/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py",
>  line 1304, in __call__  File 
> "/Users/username/server/spark-3.1.1-bin-hadoop2.7/python/pyspark/sql/utils.py",
>  line 111, in deco    return f(*a, **kw)  File 
> "/Users/username/server/spark-3.1.1-bin-hadoop2.7/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py",
>  line 326, in get_return_valuepy4j.protocol.Py4JJavaError: An error occurred 
> while calling o31.collectToPython.: org.apache.spark.SparkException: Job 
> aborted due to stage failure: Exception while getting task result: 
> java.io.StreamCorruptedException: invalid stream header: 204356EC at 
> org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2253)
>  at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2202)
>  at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2201)
>  at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) 
> at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) 
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2201) 
> at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1078)
>  at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1078)
>  at scala.Option.foreach(Option.scala:407) at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1078)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2440)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2382)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2371)
>  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:868) at 
> org.apache.spark.SparkContext.runJob(SparkContext.scala:2202) at 
> org.apache.spark.SparkContext.runJob(SparkContext.scala:2223) at 
> org.apache.spark.SparkContext.runJob(SparkContext.scala:2242) at 
> org.apache.spark.SparkContext.runJob(SparkContext.scala:2267) at 
> org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030) at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>  at org.apache.spark.rdd.RDD.withScope(RDD.scala:414) at 
> org.apache.spark.rdd.RDD.collect(RDD.scala:1029) at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:390) 
> at 
> org.apache.spark.sql.Dataset.$anonfun$collectToPython$1(Dataset.scala:3519) 
> at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3687) at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
>  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772) at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3685) at 
> org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:3516) at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at 
> py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at 
> py4j.Gateway.invoke(Gateway.java:282) at 
> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at 
> py4j.commands.CallCommand.execute(CallCommand.java:79) at 
> py4j.GatewayConnection.run(GatewayConnection.java:238) at 
> java.lang.Thread.run(Thread.java:748){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to