sinaiamonkar-sai opened a new pull request, #46391:
URL: https://github.com/apache/spark/pull/46391
### What changes were proposed in this pull request?
In a Scenario where we use GroupBy in PySpark API with relabeling of
aggregate columns and using as_index = False,
the columns with which we group by are not returned in the DataFrame. The
change proposes to fix this bug.
Example:
ps.DataFrame({"a": [0, 0], "b": [0, 1]}).groupby("a",
as_index=False).agg(b_max=("b", "max"))
Result:
b_max
0 1
Required Result:
a b_max
0 0 1
### Why are the changes needed?
The relabeling part of the code only uses only the aggregate columns. In a
scenario where as_index=True, it is not an issue as the columns with which we
group by are included in the index. When as_index=False, we need to append the
columns with which we grouped by to the relabeling code.
Please, check the commits/PR for the code changes
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- Passed GA
- Passed Build tests
- Unit Tested including scenarios in addition to the one provided in the
Jira ticket
### Was this patch authored or co-authored using generative AI tooling?
No
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]