GitHub user brkyvz opened a pull request:
https://github.com/apache/spark/pull/12008
[SPARK-14160] Time Windowing functions for DataSets
## What changes were proposed in this pull request?
This PR adds the function `window` as a column expression.
`window` can be used to bucket rows into time windows given a time column.
With this expression, performing time series analysis on batch data, as well as
streaming data should become much more simpler.
### Usage
Assume the following schema:
`sensor_id, measurement, timestamp`
To average 5 minute data every 1 minute (window length of 5 minutes, slide
duration of 1 minute), we will use:
```scala
df.groupBy(window("timestamp", â5 minutesâ, â1 minuteâ),
"sensor_id")
.agg(mean("measurement").as("avg_meas"))
```
This will generate windows such as:
```
09:00:00-09:05:00
09:01:00-09:06:00
09:02:00-09:07:00 ...
```
Intervals will start at every `slideDuration` starting at the unix epoch
(1970-01-01 00:00:00 UTC).
To start intervals at a different point of time, e.g. 30 seconds after a
minute, the `startTime` parameter can be used.
```scala
df.groupBy(window("timestamp", â5 minutesâ, â1 minuteâ, "30
second"), "sensor_id")
.agg(mean("measurement").as("avg_meas"))
```
This will generate windows such as:
```
09:00:30-09:05:30
09:01:30-09:06:30
09:02:30-09:07:30 ...
```
Support for Python will be made in a follow up PR after this.
## How was this patch tested?
This patch has some basic unit tests for the `TimeWindow` expression
testing that the parameters pass validation, and it also has some
unit/integration tests testing the correctness of the windowing and usability
in complex operations (multi-column grouping, multi-column projections, joins).
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/brkyvz/spark df-time-window
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/12008.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #12008
----
commit 168568f9e231827dd1a00eaf2f470592a5aa7ded
Author: Burak Yavuz <[email protected]>
Date: 2016-03-26T21:02:30Z
time windowing
commit df7bce039b02de75fbc4dcff521dc21ee3f86d18
Author: Burak Yavuz <[email protected]>
Date: 2016-03-26T23:31:47Z
finished adding time windowing
commit 03c14563786496276ecadb2951f818add4190396
Author: Burak Yavuz <[email protected]>
Date: 2016-03-27T19:56:33Z
works
commit a41ee4c38f88990278413d7c0e61be517f7cd5cd
Author: Burak Yavuz <[email protected]>
Date: 2016-03-27T21:10:47Z
minor clean up
commit db5046545a8761f289d516518c3e1ff91e551602
Author: Burak Yavuz <[email protected]>
Date: 2016-03-28T06:25:38Z
tests
commit 9e7febb7dcaf9c0cdd312fe6978f9137f1dbaa64
Author: Burak Yavuz <[email protected]>
Date: 2016-03-28T18:52:19Z
finished
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]