[
https://issues.apache.org/jira/browse/SPARK-48045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Paul George updated SPARK-48045:
--------------------------------
Description:
A Pandas API DataFrame groupby with as_index=False and a multilevel relabeling,
such as
{code:java}
from pyspark import pandas as ps
ps.DataFrame({"a": [0, 0], "b": [0, 1]}).groupby("a",
as_index=False).agg(b_max=("b", "max")){code}
fails to include group keys in the resulting DataFrame. This diverges from
expected behavior as well as from the behavior of native Pandas, e.g.
*actual*
{code:java}
b_max
0 1 {code}
*expected*
{code:java}
a b_max
0 0 1 {code}
A possible fix is to prepend groupby key columns to {{*order*}} and
{{*columns*}} before filtering here:
[https://github.com/apache/spark/blob/master/python/pyspark/pandas/groupby.py#L327-L328]
was:
A Pandas API DataFrame groupby with as_index=False and a multilevel relabeling,
such as
{code:java}
from pyspark import pandas as ps
ps.DataFrame({"a": [0, 0], "b": [0, 1]}).groupby("a",
as_index=False).agg(b_max=("b", "max")){code}
fails to include the group keys in the resulting DataFrame which diverges from
the expected behavior (as well as the behavior of native Pandas), e.g.
*actual*
{code:java}
b_max
0 1 {code}
*expected*
{code:java}
a b_max
0 0 1 {code}
A possible fix is to prepend groupby key columns to {{*order*}} and
{{*columns*}} before filtering here:
[https://github.com/apache/spark/blob/master/python/pyspark/pandas/groupby.py#L327-L328]
> Pandas API groupby with multi-agg-relabel ignores as_index=False
> ----------------------------------------------------------------
>
> Key: SPARK-48045
> URL: https://issues.apache.org/jira/browse/SPARK-48045
> Project: Spark
> Issue Type: Bug
> Components: Pandas API on Spark
> Affects Versions: 3.5.1
> Environment: Python 3.11, PySpark 3.5.1, Pandas=2.2.2
> Reporter: Paul George
> Priority: Minor
>
> A Pandas API DataFrame groupby with as_index=False and a multilevel
> relabeling, such as
> {code:java}
> from pyspark import pandas as ps
> ps.DataFrame({"a": [0, 0], "b": [0, 1]}).groupby("a",
> as_index=False).agg(b_max=("b", "max")){code}
> fails to include group keys in the resulting DataFrame. This diverges from
> expected behavior as well as from the behavior of native Pandas, e.g.
> *actual*
> {code:java}
> b_max
> 0 1 {code}
> *expected*
> {code:java}
> a b_max
> 0 0 1 {code}
>
> A possible fix is to prepend groupby key columns to {{*order*}} and
> {{*columns*}} before filtering here:
> [https://github.com/apache/spark/blob/master/python/pyspark/pandas/groupby.py#L327-L328]
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]