[GitHub] [incubator-druid] yurmix opened issue #6320: Moving Average query type

GitHub Mon, 10 Sep 2018 02:38:34 -0700

# [Feature] Moving Average query type
## A groupBy-wrapping query type, optimized for performing moving-average 
calculations.
Note: The concept of moving averge is also known as rolling average or running 
average.


## Background:
In general terms, a Moving Average is a calculation performed on top of a time 
series, in order to smooth out fluctuations. 

Below is an example of the smoothing effect of a moving average function:
![An example of two moving average 
curves](https://upload.wikimedia.org/wikipedia/commons/d/d9/MovingAverage.GIF 
"An example of two moving average curves")

A simple example would be a trailing seven-day average for page views. The 
value of each day in the result would be the average page views of the last 7 
days for that day:

![](https://latex.codecogs.com/gif.latex?trailingPageViews=\frac{\sum_{i=day-6}^{day}pageViews_i}{7})

An additional theoretical background can be found on the subject’s Wikipedia 
page (titled [Moving Average](https://en.wikipedia.org/wiki/Moving_average)).

## Problem:
Currently, In order to compute a moving average with Druid, one would need to 
union multiple Timeseries/GroupBy queries, one per day of the result.
In addition to being a cumbersome solution, that approach is also less 
efficient, as it requires multiple passes per row.

## Solution:
We propose a new query type, **movingAverage**, which wraps [groupBy 
query](http://druid.io/docs/latest/querying/groupbyquery.html) (Or [timeseries 
query](http://druid.io/docs/latest/querying/timeseriesquery.html) when there 
are no dimensions). 

At high level, the movingAverage is doing the following:
1. Run an inner query (of type groupBy or timeseries) to get initial daily 
aggregations.
2. Computes the moving-average function based on inner query results.
3. Return combined records with both the simple aggregation and the moving 
average.

This allows the query to avoid mulitple segment passes and aggregations per 
granularity period.

In order to allow a flexible definition of the moving average function, 
movingAverage query introduces a new interface called **Averager**. The 
averager is somewhat similar to the **Aggregator**, but while the Aggregator's 
input is a metric from the datasource, the Averager's input as an Aggregator 
from the query.

## Example:
This example is based on the `wikipedia` datasource available via the [tutorial 
examples 
package](http://druid.io/docs/latest/tutorials/tutorial-examples.tar.gz).

The supplied datasource has only a single-day worth of data, so we will use 
30-minute periods instead of the usual daily period.
_Note: I have chosen a granularity period of 30 minutes in order to have enough 
data points. In reality, the current implementation doesn't fully support 
sub-daily granularity, but should require only minor changes to accomodate such 
an enhancement._

Let's use the `delta` metric in the `wikipedia` datasource.
Say we want to compute the 7-period mean average over 30-minute periods of 
`delta`.
We will define both an aggregator and an averager for this task using the 
movingAverage query syntax:

```json
{
  "queryType": "movingAverage",
  "dataSource": "wikipedia",
  "granularity": {
    "type": "period",
    "period": "PT30M"
  },
  "intervals": [
    "2015-09-12T00:00:00Z/2015-09-13T00:00:00Z"
  ],
  "aggregations": [
    {
      "name": "delta30Min",
      "fieldName": "delta",
      "type": "longSum"
    }
  ],
  "averagers": [
    {
      "name": "trailing30MinChanges",
      "fieldName": "delta30Min",
      "type": "longMean",
      "buckets": 7
    }
  ]
}
```

Note that this syntax is derived from groupBy, with adding the **averages** 
JSON Object:

**name**: Output name.
**fieldName**: Input (aggregator) name.
**type**: Formula type (longMean/doubleMean/doubleMax/etc. Full list will be 
included in the documentation).
**bucket**: Number of buckets to look back.

The result is inherited from the groupBy formtat:
```json
[ {
  "version" : "v1",
  "timestamp" : "2015-09-12T00:30:00.000Z",
  "event" : {
    "delta30Min" : 30490,
    "trailing30MinChanges" : 4355.714285714285
  }
}, {
  "version" : "v1",
  "timestamp" : "2015-09-12T01:00:00.000Z",
  "event" : {
    "delta30Min" : 96526,
    "trailing30MinChanges" : 18145.14285714286
  }
}, {
  "version" : "v1",
  "timestamp" : "2015-09-12T01:30:00.000Z",
  "event" : {
    "delta30Min" : 87887,
    "trailing30MinChanges" : 30700.428571428572
  }
}, {
  "version" : "v1",
  "timestamp" : "2015-09-12T02:00:00.000Z",
  "event" : {
    "delta30Min" : 254632,
    "trailing30MinChanges" : 67076.42857142857
  }
} ]
```

A graph of the result will show a smoothing effect:
![](https://docs.google.com/spreadsheets/d/e/2PACX-1vRBMWvSs2IFuEHfCqWWdLgXQng512RnyenX9aeST7mG-haqcAZrXDCm_m2HT25adbKf6Op-e33npJTm/pubchart?oid=685921408&format=image)

There are a few more advanced aspects to the implementation and usage of 
movingAverage. Those will be avaiable in the pull request (via the code and the 
decumentation).

[ Full content available at: 
https://github.com/apache/incubator-druid/issues/6320 ]
This message was relayed via gitbox.apache.org for [email protected]

[GitHub] [incubator-druid] yurmix opened issue #6320: Moving Average query type

Reply via email to