Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/20211#discussion_r161371619
--- Diff: python/pyspark/sql/group.py ---
@@ -233,6 +233,27 @@ def apply(self, udf):
| 2| 1.1094003924504583|
+---+-------------------+
+ Notes on grouping column:
--- End diff --
Up to my knowledge, the current implementation follows Pandas's default
groupBy - apply when Pandas DataFrame -> Pandas DataFrame (correct me if I am
wrong). So, I was thinking that we shouldn't start with prepending the grouping
columns but we could alternatively consider an idea of `gapply` in somehow ..
I think it's still feasible to have both ideas - If the given function
takes single argument, we can give the input as pdf. If it takes two arguments,
we can give key and pdf as input. I think we can support the `gapply`-like
support optionally.
It's a rough idea but I think we can do this in theory as we know and can
`inspect` the function ahead before computation.
WDYT guys?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]