Philip Kahn created SPARK-30239:
-----------------------------------
Summary: [Python] Creating a dataframe with Pandas rather than
Numpy datatypes fails
Key: SPARK-30239
URL: https://issues.apache.org/jira/browse/SPARK-30239
Project: Spark
Issue Type: Bug
Components: PySpark
Affects Versions: 2.4.3
Environment: DataBricks: 48.00 GB | 24 Cores | DBR 6.0 | Spark 2.4.3 |
Scala 2.11
Reporter: Philip Kahn
It's possible to work with DataFrames in Pandas and shuffle them back over to
Spark dataframes for processing; however, using Pandas extended datatypes like
{{Int64 }}(
[https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html] )
throws an error (that long / float can't be converted).
This is internally because {{np.nan}} is a float, and {{pd.Int64DType()}}
allows only integers except for the single float value {{np.nan}}.
The current workaround for this is to use the columns as floats, and after
conversion to the Spark DataFrame, to recast the column as {{LongType()}}. For
example:
{{sdfC = spark.createDataFrame(kgridCLinked)}}
{{sdfC = sdfC.withColumn("gridID", sdfC["gridID"].cast(LongType()))}}
However, this is awkward and redundant.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]