Raman Srinivasan created SPARK-34348:
----------------------------------------

             Summary: applyInPandas doesn't seem to work with StructType output 
schema 
                 Key: SPARK-34348
                 URL: https://issues.apache.org/jira/browse/SPARK-34348
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 3.0.1
            Reporter: Raman Srinivasan


 
{code:java}
df = spark.createDataFrame(
    [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
    ("id", "v"))

def subtract_mean(pdf):
    # pdf is a pandas.DataFrame
    pdf['count'] = pdf.shape[0]
    return pdf{code}
 

 

Using a DDL-formatted string for output schema works fine:

 
{code:java}
df.groupby("id").applyInPandas(subtract_mean, schema="id long, v double, count 
int").show()

+---+----+-----+
| id|   v|count|
+---+----+-----+
|  1| 1.0|    2|
|  1| 2.0|    2|
|  2| 3.0|    3|
|  2| 5.0|    3|
|  2|10.0|    3|
+---+----+-----+
{code}
 

 

But using StructType schema (appending a integer count column) fails:
{code:java}
df.groupby("id").applyInPandas(subtract_mean, 
schema=df.schema.add(StructField('count', IntegerType(), False))).show()

AnalysisException: Cannot resolve column name "count" among (id, v);

{code}
It appears to be looking for the new return field in the input schema?

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to