GitHub user icexelloss opened a pull request:
https://github.com/apache/spark/pull/21082
[SPARK-22239][SQL][Python][WIP] Enable grouped aggregate pandas UDFs as
window functions
## What changes were proposed in this pull request?
This PR enables using a grouped aggregate pandas UDFs as window functions.
The semantics is the same as using SQL aggregation function as window functions.
```
w = Window.partitionBy('id').rowsBetween(Window.unboundedPreceding,
Window.unboundedFollowing)
mean_udf = pandas_udf(lambda v: v.mean(), 'double',
PandasUDFType.GROUPED_AGG)
result1 = df.withColumn('mean_v', mean_udf(df['v']).over(w))
```
The scope of this PR is somewhat limited in terms of:
(1) Only supports unbounded window, which acts essentially as group by.
(2) Only supports aggregation functions, not "transform" like window
functions (n -> n mapping)
Both of these are left as future work. Especially, (1) needs careful
thinking w.r.t. how to pass rolling window data to python efficiently. (2) is a
bit easier but does require more changes therefore I think it's better to leave
it as a separate PR.
**This PR is currently WIP**
## How was this patch tested?
WindowPandasUDFTests
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/icexelloss/spark SPARK-22239-window-udf
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/21082.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #21082
----
commit 54609cf97aa7e6b3d55f288e1b2aa92ac91e9b28
Author: Li Jin <ice.xelloss@...>
Date: 2018-03-24T19:52:55Z
wip
commit f454933ac45bbf4a8bf0b87192bf4b323724b3fe
Author: Li Jin <ice.xelloss@...>
Date: 2018-04-04T14:41:03Z
wip
commit e5207455d835198fcab253a99b490784ee04b3cf
Author: Li Jin <ice.xelloss@...>
Date: 2018-04-05T14:06:53Z
wip
commit 78dc82b881fa15b4e6ef0380177418df58df392d
Author: Li Jin <ice.xelloss@...>
Date: 2018-04-06T18:34:48Z
wip
commit 083ae4a5b9676eb953b181f98ba0c5c1fb3fce47
Author: Li Jin <ice.xelloss@...>
Date: 2018-04-16T21:23:37Z
Test passes
commit 15bbedf92b27afc0234669a0fcd21d141511aa17
Author: Li Jin <ice.xelloss@...>
Date: 2018-04-16T21:54:15Z
Clean up
commit 6a964d433b6e318af515bfc3ee38c8e3621872d7
Author: Li Jin <ice.xelloss@...>
Date: 2018-04-16T21:56:24Z
white space
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]