I was asked by @willholley to contribute this Markdown which describes a 
possible `aggregator` syntax for supporting aggregations in Mango. 

--------

# What would Mango aggregation look like?


At the moment, the Mango query language only performs data selection - a 
portion of a larger data set can be returned by providing a JSON query. If my 
data looks like this:

```js
{
  _id: "someid",
  date: "2018-08-24",
  status: "provisional",
  invoiceAddress: {
    street: "10 Front Street",
    city: "Dallas",
    state: "Texas"
  },
  amount: 7.99,
  tax: 0,
  totalAmount: 7.99,
  customerId: "A65522",
  lineItems: [
    {
      productId: "P1",
      name: "fish",
      cost: 6.0
    },
     {
      productId: "P2",
      name: "chips",
      cost: 1.99
    }
  ]
}
```  

and I want only the "complete" orders from a database I could perform a query.

```js
{
  selector: {
     status : "complete"
  },
  fields: ["totalAmount", "date"]
}
```

which would give a paged result set in blocks of 200 records:

```js
{
        "docs": [
          {
            "totalAmount": 7.99,
            "date": "2018-08-24"
          },
          {
            "totalAmount": 4.50,
            "date": "2018-08-24"
          },
          ...
        ],
        "bookmark": 
"g1AAAAA6eJzLYWBgYMpgSmHgKy5JLCrJTq2MT8lPzkzJBYqzVqUmFSWCJDlgkgjhLADXERDn",
        "execution_stats": {
                "total_keys_examined": 0,
                "total_docs_examined": 10,
                "total_quorum_docs_examined": 0,
                "results_returned": 8,
                "execution_time_ms": 2.75
        },
        "warning": "no matching index found, create an index to optimize query 
time"
}
```

But what I really want is a grand total of the `totalAmount` fields, grouped by 
`date` but Mango only supports *selection* not *aggregation*. I would have to 
page through all the results and perform the aggregation in my code.

This document imagines that Mango magically *does* support aggregation. 
Everything that follows is fictional syntax in my imaginary world.

--- start of imaginary world ---

## My first Mango aggregation

I can use the new "aggregator" object in a Mango query. My query still has a 
selector, because I don't want to aggregate ALL the documents, only the 
"complete" orders as before:

```js
{
  selector: {
    status : "complete"
  },
  aggregator: {
    operation: "sum",
    of: ["totalAmount"]
  }
}
```

Note:

- I no longer need to supply a `fields` object at the top level. Mango knows 
the fields it needs from those specified in the aggregator object.
- the `aggregator.operation` field can be one of `sum`, `count` or `stats`.
- `aggregator.of` is an array - we'll see how this works later.

If I'm using the `count` aggregation, I needn't supplied an `aggregation.of` 
array:

```js
{
  selector: {
    status : "complete"
  },
  aggregator: {
    operation: "count"
  }
}
```

## A grand total example

Using an `aggreator.operation` of `sum` allows me to get a grand total of a 
field:

```
{
  selector: {
    status : "complete"
  },
  aggregator: {
    operation: "sum",
    of: ["totalAmount"]
  }
}
```

or fields:

```
{
  selector: {
    status : "complete"
  },
  aggregator: {
    operation: "sum",
    of: ["totalAmount", "tax"]
  }
}
```

or by emptying the "selector", all documents are aggregated:

```
{
  selector: { },
  aggregator: {
    operation: "sum",
    of: ["totalAmount", "tax"]
  }
}
```

## Grouping

In this example I am performing a more complex selector and introduce grouping:

```
{
  selector: {
    "$or": {
      status : "complete",
      status: "refunded"
    }
  },
  aggregator: {
    operation: "sum",
    of: ["totalAmount", "tax"],
    group: ["date"]
  }
}
```

Any valid Mango selector is allowed in the "selector" object. The "selector" is 
evaluated at index time and decides which portion of the data makes it to the 
index. If the selector is changed, a new index is required to calculate the 
result.

The optional "aggregator.group" is an array of keys by which the sum is grouped 
in the result set. It is an array, so I can have multi-dimensional grouping:


```
{
  selector: {
    "$or": {
      status : "complete",
      status: "refunded"
    }
  },
  aggregator: {
    operation: "sum",
    of: ["totalAmount", "tax"],
    group: ["date", "invoiceAddress.state"]
  }
}
```

The above example groups by "invoiceAddress.state" which demonstrates selecting 
data from a sub-object.


## How this works

CouchDB cannot perform aggregations without an index, but instead of insisting 
that the user perform an additional "index" step, CouchDB will create the 
appropriate index (if it doesn't already exist) and return the results once 
indexing is complete.

** this is a big leap - but, hey, this is an imaginary world **

e.g. for an aggregation query like this:

```
{
  selector: {
    status : "complete"
  },
  aggregator: {
    operation: "sum",
    of: ["totalAmount"],
    group: ["date"]
  }
}
```

we need an index that looks like this in old-school JavaScript MapReduce

```js
map: 
        function(doc) {
          if (doc.status === 'complete') {
            emit([doc.date], [doc.amount])
          }
        }
reduce:
   "_sum"
```

The "selector" in an aggregation is really a `partial_filter_selector` so that 
the index only contains a sub-set of the documents.

## Limit the result set with range

The trouble with this approach is that the aggregation query always returns all 
of the aggregations e.g. if one sum per day for every day that you have data. 
You may only want this month's per-day aggregates. This is where 
`aggregator.range` comes in:

```
{
  selector: {
    "$or": {
      status : "complete",
      status: "refunded"
    }
  },
  aggregator: {
    operation: "sum",
    of: ["totalAmount", "tax"],
    group: ["date", "invoiceAddress.state"],
    range: {
      start: ["2018-08-01"],
      end: ["2018-09-01"]
    }
  }
}
```

The `aggregator.range` object can be used at query time to return a smaller 
proportion of the aggregated result, whether grouping or not.

---- end of imaginary world ---

[ Full content available at: https://github.com/apache/couchdb/issues/1254 ]
This message was relayed via gitbox.apache.org for [email protected]

Reply via email to