Raman Srinivasan created SPARK-34348:
----------------------------------------
Summary: applyInPandas doesn't seem to work with StructType output
schema
Key: SPARK-34348
URL: https://issues.apache.org/jira/browse/SPARK-34348
Project: Spark
Issue Type: Bug
Components: PySpark
Affects Versions: 3.0.1
Reporter: Raman Srinivasan
{code:java}
df = spark.createDataFrame(
[(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
("id", "v"))
def subtract_mean(pdf):
# pdf is a pandas.DataFrame
pdf['count'] = pdf.shape[0]
return pdf{code}
Using a DDL-formatted string for output schema works fine:
{code:java}
df.groupby("id").applyInPandas(subtract_mean, schema="id long, v double, count
int").show()
+---+----+-----+
| id| v|count|
+---+----+-----+
| 1| 1.0| 2|
| 1| 2.0| 2|
| 2| 3.0| 3|
| 2| 5.0| 3|
| 2|10.0| 3|
+---+----+-----+
{code}
But using StructType schema (appending a integer count column) fails:
{code:java}
df.groupby("id").applyInPandas(subtract_mean,
schema=df.schema.add(StructField('count', IntegerType(), False))).show()
AnalysisException: Cannot resolve column name "count" among (id, v);
{code}
It appears to be looking for the new return field in the input schema?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]