Paul George created SPARK-48045:
-----------------------------------
Summary: Pandas groupby with multi-agg-relabel ignores
as_index=False
Key: SPARK-48045
URL: https://issues.apache.org/jira/browse/SPARK-48045
Project: Spark
Issue Type: Bug
Components: Pandas API on Spark
Affects Versions: 3.5.1
Environment: Python 3.11, PySpark 3.5.1, Pandas=2.2.2
Reporter: Paul George
A Pandas API DataFrame groupby with as_index=False and a multilevel relabeling,
such as
{code:java}
from pyspark import pandas as ps
ps.DataFrame({"a": [0, 0], "b": [0, 1]}).groupby("a",
as_index=False).agg(b_max=("b", "max")){code}
fails to include the group keys in the resulting DataFrame which diverges from
the expected behavior (as well as the behavior of native Pandas), e.g.
*actual*
{code:java}
b_max
0 1 {code}
*expected*
{code:java}
a b_max
0 0 1 {code}
A possible fix is to prepend groupby key index columns to `order` and `columns`
before filtering here:
[https://github.com/apache/spark/blob/master/python/pyspark/pandas/groupby.py#L327-L328]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]