Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20211#discussion_r161371619 --- Diff: python/pyspark/sql/group.py --- @@ -233,6 +233,27 @@ def apply(self, udf): | 2| 1.1094003924504583| +---+-------------------+ + Notes on grouping column: --- End diff -- Up to my knowledge, the current implementation follows Pandas's default groupBy - apply when Pandas DataFrame -> Pandas DataFrame (correct me if I am wrong). So, I was thinking that we shouldn't start with prepending the grouping columns but we could alternatively consider an idea of `gapply` in somehow .. I think it's still feasible to have both ideas - If the given function takes single argument, we can give the input as pdf. If it takes two arguments, we can give key and pdf as input. I think we can support the `gapply`-like support optionally. It's a rough idea but I think we can do this in theory as we know and can `inspect` the function ahead before computation. WDYT guys?
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org