[
https://issues.apache.org/jira/browse/SPARK-48045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Paul George updated SPARK-48045:
--------------------------------
Summary: Pandas API groupby with multi-agg-relabel ignores as_index=False
(was: Pandas groupby with multi-agg-relabel ignores as_index=False)
> Pandas API groupby with multi-agg-relabel ignores as_index=False
> ----------------------------------------------------------------
>
> Key: SPARK-48045
> URL: https://issues.apache.org/jira/browse/SPARK-48045
> Project: Spark
> Issue Type: Bug
> Components: Pandas API on Spark
> Affects Versions: 3.5.1
> Environment: Python 3.11, PySpark 3.5.1, Pandas=2.2.2
> Reporter: Paul George
> Priority: Minor
>
> A Pandas API DataFrame groupby with as_index=False and a multilevel
> relabeling, such as
> {code:java}
> from pyspark import pandas as ps
> ps.DataFrame({"a": [0, 0], "b": [0, 1]}).groupby("a",
> as_index=False).agg(b_max=("b", "max")){code}
> fails to include the group keys in the resulting DataFrame which diverges
> from the expected behavior (as well as the behavior of native Pandas), e.g.
> *actual*
> {code:java}
> b_max
> 0 1 {code}
> *expected*
> {code:java}
> a b_max
> 0 0 1 {code}
>
> A possible fix is to prepend groupby key columns to {{*order*}} and
> {{*columns*}} before filtering here:
> [https://github.com/apache/spark/blob/master/python/pyspark/pandas/groupby.py#L327-L328]
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]