Paul George created SPARK-48045:
-----------------------------------

             Summary: Pandas groupby with multi-agg-relabel ignores 
as_index=False
                 Key: SPARK-48045
                 URL: https://issues.apache.org/jira/browse/SPARK-48045
             Project: Spark
          Issue Type: Bug
          Components: Pandas API on Spark
    Affects Versions: 3.5.1
         Environment: Python 3.11, PySpark 3.5.1, Pandas=2.2.2
            Reporter: Paul George


A Pandas API DataFrame groupby with as_index=False and a multilevel relabeling, 
such as

 

 
{code:java}
from pyspark import pandas as ps
ps.DataFrame({"a": [0, 0], "b": [0, 1]}).groupby("a", 
as_index=False).agg(b_max=("b", "max")){code}
 
 

fails to include the group keys in the resulting DataFrame which diverges from 
the expected behavior (as well as the behavior of native Pandas), e.g.

 

*actual*

 
{code:java}
   b_max
0      1 {code}
 

 

*expected*

 
{code:java}
   a  b_max
0  0      1 {code}
 

A possible fix is to prepend groupby key index columns to `order` and `columns` 
before filtering here:  
[https://github.com/apache/spark/blob/master/python/pyspark/pandas/groupby.py#L327-L328]
 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to