Pavel Ganelin created SPARK-34521:
-------------------------------------
Summary: spark.createDataFrame does not support Pandas StringDtype
extension type
Key: SPARK-34521
URL: https://issues.apache.org/jira/browse/SPARK-34521
Project: Spark
Issue Type: Bug
Components: PySpark
Affects Versions: 3.0.1
Reporter: Pavel Ganelin
The following test case demonstrates the problem:
{code:java}
import pandas as pd
from pyspark.sql import SparkSession, types
spark = SparkSession.builder.appName(__file__)\
.config("spark.sql.execution.arrow.pyspark.enabled","true") \
.getOrCreate()
good = pd.DataFrame([["abc"]], columns=["col"])
schema = types.StructType([types.StructField("col", types.StringType(), True)])
df = spark.createDataFrame(good, schema=schema)
df.show()
bad = good.copy()
bad["col"]=bad["col"].astype("string")
schema = types.StructType([types.StructField("col", types.StringType(), True)])
df = spark.createDataFrame(bad, schema=schema)
df.show(){code}
The error:
{code:java}
C:\Python\3.8.3\lib\site-packages\pyspark\sql\pandas\conversion.py:289:
UserWarning: createDataFrame attempted Arrow optimization because
'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by
the reason below:
Cannot specify a mask or a size when passing an object that is converted with
the __arrow_array__ protocol.
Attempting non-optimization as
'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.
warnings.warn(msg)
{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]