[jira] [Updated] (SPARK-48045) Pandas API groupby with multi-agg-relabel ignores as_index=False

Paul George (Jira) Tue, 30 Apr 2024 18:00:36 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-48045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Paul George updated SPARK-48045:
--------------------------------
    Description: 
A Pandas API DataFrame groupby with as_index=False and a multilevel relabeling, 
such as
{code:java}
from pyspark import pandas as ps
ps.DataFrame({"a": [0, 0], "b": [0, 1]}).groupby("a", 
as_index=False).agg(b_max=("b", "max")){code}
fails to include group keys in the resulting DataFrame. This diverges from 
expected behavior as well as from the behavior of native Pandas, e.g.

*actual*
{code:java}
   b_max
0      1 {code}
*expected*
{code:java}
   a  b_max
0  0      1 {code}
 

A possible fix is to prepend groupby key columns to {{*order*}} and 
{{*columns*}} before filtering here:  
[https://github.com/apache/spark/blob/master/python/pyspark/pandas/groupby.py#L327-L328]
 

 

  was:
A Pandas API DataFrame groupby with as_index=False and a multilevel relabeling, 
such as
{code:java}
from pyspark import pandas as ps
ps.DataFrame({"a": [0, 0], "b": [0, 1]}).groupby("a", 
as_index=False).agg(b_max=("b", "max")){code}
fails to include the group keys in the resulting DataFrame which diverges from 
the expected behavior (as well as the behavior of native Pandas), e.g.

*actual*
{code:java}
   b_max
0      1 {code}
*expected*
{code:java}
   a  b_max
0  0      1 {code}
 

A possible fix is to prepend groupby key columns to {{*order*}} and 
{{*columns*}} before filtering here:  
[https://github.com/apache/spark/blob/master/python/pyspark/pandas/groupby.py#L327-L328]
 

 


> Pandas API groupby with multi-agg-relabel ignores as_index=False
> ----------------------------------------------------------------
>
>                 Key: SPARK-48045
>                 URL: https://issues.apache.org/jira/browse/SPARK-48045
>             Project: Spark
>          Issue Type: Bug
>          Components: Pandas API on Spark
>    Affects Versions: 3.5.1
>         Environment: Python 3.11, PySpark 3.5.1, Pandas=2.2.2
>            Reporter: Paul George
>            Priority: Minor
>
> A Pandas API DataFrame groupby with as_index=False and a multilevel 
> relabeling, such as
> {code:java}
> from pyspark import pandas as ps
> ps.DataFrame({"a": [0, 0], "b": [0, 1]}).groupby("a", 
> as_index=False).agg(b_max=("b", "max")){code}
> fails to include group keys in the resulting DataFrame. This diverges from 
> expected behavior as well as from the behavior of native Pandas, e.g.
> *actual*
> {code:java}
>    b_max
> 0      1 {code}
> *expected*
> {code:java}
>    a  b_max
> 0  0      1 {code}
>  
> A possible fix is to prepend groupby key columns to {{*order*}} and 
> {{*columns*}} before filtering here:  
> [https://github.com/apache/spark/blob/master/python/pyspark/pandas/groupby.py#L327-L328]
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-48045) Pandas API groupby with multi-agg-relabel ignores as_index=False

Reply via email to