Re: Aggregates: anything planned for 1.0?

Russell Keith-Magee Thu, 12 Apr 2007 20:11:07 -0700

On 4/13/07, Honza Král <[EMAIL PROTECTED]> wrote:
> the queryset refactoring must (well, should) happen before aggregation
> support. After the refactoring, it should be relatively easy to put in
> the aggregations.
>
> see:
> http://code.djangoproject.com/ticket/3566


Ok; I've had a chance to look at this now. Although there are some
syntax differences, I think your idea and mine [1] are acutally pretty
close. A bit of both, and we might just have something.

Comments:

GROUP BY = values()
~~~~~~~~~~~~~~~~~~

Isn't the group_by clause duplicating what is already provided by the
values() filter? i.e., saying 'return me a list of dictionaries of
object data that isn't a full object'? When the aggregates clause is
in use, a GROUP BY over the selected fields is implied. If you don't
provide a values clause, then you want to GROUP BY all the fields in
the model

So, your example would become:

>>> queryset = Model.objects.all()
>>> queryset.values('name', 'city').aggregates(sum=('pay',
'some_other_field'), avg=('pay', 'age'), count=True)

Output format
~~~~~~~~~~~~~
As my proposal established, I'm a fan of the 'augmenting the returned
data' approach:

[
 { 'name' : 'John Doe',
   'city' : 'Prague',
   'sum' :' {
      'pay' : 50000,
      'some_other_field' : 10000
   }
   'average' : {
      'pay' : 10000,
      'age' : 30
   },
   'count' : 5,
 },
 ...
]

To me, the clause is returning a list of data groups (i.e., the
results of the values() clause); putting the grouping data in a
group_by dictionary seems overkill. This approach does introduce some
new magic-words that are not allowed as field names (or are at least
dangerous if you do). Personally, I don't see that as a major problem.

Weaknesses in the function=tuple argument approach
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

While the function=tuple of field names approach does work, I feel it
is lacking:
- You can't use aliases: SUM(field) as "total_pay"
- Filtering on aggregated values is difficult (impossible?)
- It doesn't handle joins in the same way as filter()

The approach I proposed was to mirror the filter syntax:
pay__sum='total_pay'
other_field__average='some_alias'

This approach mirrors the filter mechanism for attribute naming and
joins, allows for aliases, which provides a mechanism to filter on
aliases.

The only downside I can see is that the simple case requires the user
to provide an alias - i.e., pay__sum='pay_sum'. I'll admit that this
isn't particularly elegant.

However, it should be possible to combine both approaches:

>>> Model.objects.aggregate(sum=('pay'), other_field__average='mean_value')
[
 { 'name' : 'John Doe',
   'city' : 'Prague',
   'sum' :' {
      'pay' : 50000,
   }
   'mean_value' :30,
 },
 ...
]

This means the sum of pay isn't aliased, so it can't be filtered, but
you can filter on 'mean_value'. Argument clashes aren't a problem
either - 'sum' by itself is unambiguous as an argument; any
filter-like argument will need to have at least a '__' in it.

Ambiguities in COUNT
~~~~~~~~~~~~~~~~~~~~

aggregate(count=True) is already covered with count(); I'm not sure
I'm wild about the duplication. There is also the problem that it
precludes counting other things: e.g., counting related objects -
return the number of books that each author has written.

Author.objects.aggregate(books_count='number_of_books_written')
or
Author.objects.aggregate(count=('books'))

On top of this, you can get the gross count from:

len(Author.objects.aggregate(...))

Aggregate of results vs aggregates of related objects
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

If you run an aggregate over a field from the table you are querying,
the aggregate is a table level property, not a row level property.
Compare the expected results from:

- Get me the date of the earliest and latest published article: Return
a single value for the min, and a single value for the max.
- Annotate each author with the date of their earliest and latest
published article: Return a min and max value for each author.

The second case is relatively easy:

>>> Author.objects.aggregate(article__pub_date__min='earliest',
article__pub_date__max='latest'))
[{
    'name':'John Doe',
    'earliest': 2007-01-04
    'latest': 2007-02-14
}
...
]

However, the first case:

>>> Article.objects.aggregate(pub_date__min='earliest', pub_date__max='latest'))
{[
 'title': 'first article'
 'earliest': 2007-01-04
 'latest': 2007-02-14
],
[
 'title': 'second article'
 'earliest': 2007-01-04
 'latest': 2007-02-14
],
}

while legal, isn't the most elegant way at getting at a unique min/max
for an entire table. This was the reason for my suggestion for two
functions: aggregate and annotate. Aggregate returns a dictionary of
just aggregated values:

>>> Article.objects.aggregate(min=('pub_date'), max=('pub_date'))
{
   'min' : '2007-01-04',
   'max' : '2007-02-14'
}

Whereas annotate() returns a list of annotated dictionary data.

================================

I'd better stop now - this is getting longer than I originally
anticipated, and if it gets much longer nobody will read it :-) As
always, feedback welcome.

Yours,
Russ Magee %-)

[1] http://groups.google.com/group/django-developers/msg/2f91fa217bc465bb

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to [EMAIL PROTECTED]
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en
-~----------~----~----~----~------~----~------~--~---

Re: Aggregates: anything planned for 1.0?

Reply via email to