[
https://issues.apache.org/jira/browse/SPARK-27778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
David Vogelbacher updated SPARK-27778:
--------------------------------------
Summary: toPandas with arrow enabled fails for DF with no partitions (was:
toPandas with arrow enabled fails for DF with no partition)
> toPandas with arrow enabled fails for DF with no partitions
> -----------------------------------------------------------
>
> Key: SPARK-27778
> URL: https://issues.apache.org/jira/browse/SPARK-27778
> Project: Spark
> Issue Type: Bug
> Components: PySpark, SQL
> Affects Versions: 3.0.0
> Reporter: David Vogelbacher
> Priority: Major
>
> Calling to pandas with {{spark.sql.execution.arrow.enabled: true}} fails for
> dataframes with no partitions. The error is a {{EOFError}}. With
> {{spark.sql.execution.arrow.enabled: false}} the conversion.
> Repro (on current master branch):
> {noformat}
> >>> from pyspark.sql.types import *
> >>> schema = StructType([StructField("field1", StringType(), True)])
> >>> df = spark.createDataFrame(sc.emptyRDD(), schema)
> >>> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
> >>> df.toPandas()
> /Users/dvogelbacher/git/spark/python/pyspark/sql/dataframe.py:2162:
> UserWarning: toPandas attempted Arrow optimization because
> 'spark.sql.execution.arrow.enabled' is set to true, but has reached the error
> below and can not continue. Note that
> 'spark.sql.execution.arrow.fallback.enabled' does not have an effect on
> failures in the middle of computation.
> warnings.warn(msg)
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> File "/Users/dvogelbacher/git/spark/python/pyspark/sql/dataframe.py", line
> 2143, in toPandas
> batches = self._collectAsArrow()
> File "/Users/dvogelbacher/git/spark/python/pyspark/sql/dataframe.py", line
> 2205, in _collectAsArrow
> results = list(_load_from_socket(sock_info, ArrowCollectSerializer()))
> File "/Users/dvogelbacher/git/spark/python/pyspark/serializers.py", line
> 210, in load_stream
> num = read_int(stream)
> File "/Users/dvogelbacher/git/spark/python/pyspark/serializers.py", line
> 810, in read_int
> raise EOFError
> EOFError
> >>> spark.conf.set("spark.sql.execution.arrow.enabled", "false")
> >>> df.toPandas()
> Empty DataFrame
> Columns: [field1]
> Index: []
> {noformat}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]