[
https://issues.apache.org/jira/browse/SPARK-34348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17278404#comment-17278404
]
Raman Srinivasan commented on SPARK-34348:
------------------------------------------
My mistake to modify the schema of the original dataframe inplace
> applyInPandas doesn't seem to work with StructType output schema
> -----------------------------------------------------------------
>
> Key: SPARK-34348
> URL: https://issues.apache.org/jira/browse/SPARK-34348
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 3.0.1
> Reporter: Raman Srinivasan
> Priority: Major
>
>
> {code:java}
> df = spark.createDataFrame(
> [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
> ("id", "v"))
> def subtract_mean(pdf):
> # pdf is a pandas.DataFrame
> pdf['count'] = pdf.shape[0]
> return pdf{code}
>
>
> Using a DDL-formatted string for output schema works fine:
> {code:java}
> df.groupby("id").applyInPandas(subtract_mean, schema="id long, v double,
> count int").show()
> +---+----+-----+
> | id| v|count|
> +---+----+-----+
> | 1| 1.0| 2|
> | 1| 2.0| 2|
> | 2| 3.0| 3|
> | 2| 5.0| 3|
> | 2|10.0| 3|
> +---+----+-----+
> {code}
>
>
> But using StructType schema (appending a integer count column) fails:
> {code:java}
> df.groupby("id").applyInPandas(subtract_mean,
> schema=df.schema.add(StructField('count', IntegerType(), False))).show()
> AnalysisException: Cannot resolve column name "count" among (id, v);
> {code}
> It appears to be looking for the new return field in the input schema?
> As a workaround, is there a toDDL method I can use to get the current schema
> as a DDL string to which I can append the new return fields?
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]