[
https://issues.apache.org/jira/browse/SPARK-48045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated SPARK-48045:
-----------------------------------
Labels: pull-request-available (was: )
> Pandas API groupby with multi-agg-relabel ignores as_index=False
> ----------------------------------------------------------------
>
> Key: SPARK-48045
> URL: https://issues.apache.org/jira/browse/SPARK-48045
> Project: Spark
> Issue Type: Bug
> Components: Pandas API on Spark
> Affects Versions: 3.5.1
> Environment: Python 3.11, PySpark 3.5.1, Pandas=2.2.2
> Reporter: Paul George
> Priority: Minor
> Labels: pull-request-available
>
> A Pandas API DataFrame groupby with as_index=False and a multilevel
> relabeling, such as
> {code:java}
> from pyspark import pandas as ps
> ps.DataFrame({"a": [0, 0], "b": [0, 1]}).groupby("a",
> as_index=False).agg(b_max=("b", "max")){code}
> fails to include group keys in the resulting DataFrame. This diverges from
> expected behavior as well as from the behavior of native Pandas, e.g.
> *actual*
> {code:java}
> b_max
> 0 1 {code}
> *expected*
> {code:java}
> a b_max
> 0 0 1 {code}
>
> A possible fix is to prepend groupby key columns to {{*order*}} and
> {{*columns*}} before filtering here:
> [https://github.com/apache/spark/blob/master/python/pyspark/pandas/groupby.py#L327-L328]
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]