[GitHub] spark pull request: [SPARK-14160] Time Windowing functions for Dat...

brkyvz Mon, 28 Mar 2016 12:07:34 -0700

GitHub user brkyvz opened a pull request:

    https://github.com/apache/spark/pull/12008


    [SPARK-14160] Time Windowing functions for DataSets

    ## What changes were proposed in this pull request?
    
    This PR adds the function `window` as a column expression.
    
    `window` can be used to bucket rows into time windows given a time column. 
With this expression, performing time series analysis on batch data, as well as 
streaming data should become much more simpler.
    
    ### Usage
    
    Assume the following schema:
    
    `sensor_id, measurement, timestamp`
    
    To average 5 minute data every 1 minute (window length of 5 minutes, slide 
duration of 1 minute), we will use:
    ```scala
    df.groupBy(window("timestamp", â5 minutesâ, â1 minuteâ), 
"sensor_id")
      .agg(mean("measurement").as("avg_meas"))
    ```
    
    This will generate windows such as:
    ```
    09:00:00-09:05:00
    09:01:00-09:06:00
    09:02:00-09:07:00 ...
    ```
    
    Intervals will start at every `slideDuration` starting at the unix epoch 
(1970-01-01 00:00:00 UTC).
    To start intervals at a different point of time, e.g. 30 seconds after a 
minute, the `startTime` parameter can be used.
    
    ```scala
    df.groupBy(window("timestamp", â5 minutesâ, â1 minuteâ, "30 
second"), "sensor_id")
      .agg(mean("measurement").as("avg_meas"))
    ```
    
    This will generate windows such as:
    ```
    09:00:30-09:05:30
    09:01:30-09:06:30
    09:02:30-09:07:30 ...
    ```
    
    Support for Python will be made in a follow up PR after this.
    
    ## How was this patch tested?
    
    This patch has some basic unit tests for the `TimeWindow` expression 
testing that the parameters pass validation, and it also has some 
unit/integration tests testing the correctness of the windowing and usability 
in complex operations (multi-column grouping, multi-column projections, joins).
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/brkyvz/spark df-time-window

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/12008.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #12008
    
----
commit 168568f9e231827dd1a00eaf2f470592a5aa7ded
Author: Burak Yavuz <[email protected]>
Date:   2016-03-26T21:02:30Z

    time windowing

commit df7bce039b02de75fbc4dcff521dc21ee3f86d18
Author: Burak Yavuz <[email protected]>
Date:   2016-03-26T23:31:47Z

    finished adding time windowing

commit 03c14563786496276ecadb2951f818add4190396
Author: Burak Yavuz <[email protected]>
Date:   2016-03-27T19:56:33Z

    works

commit a41ee4c38f88990278413d7c0e61be517f7cd5cd
Author: Burak Yavuz <[email protected]>
Date:   2016-03-27T21:10:47Z

    minor clean up

commit db5046545a8761f289d516518c3e1ff91e551602
Author: Burak Yavuz <[email protected]>
Date:   2016-03-28T06:25:38Z

    tests

commit 9e7febb7dcaf9c0cdd312fe6978f9137f1dbaa64
Author: Burak Yavuz <[email protected]>
Date:   2016-03-28T18:52:19Z

    finished

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-14160] Time Windowing functions for Dat...

Reply via email to