Re: Annotation failure (Combining multiple aggregations)

Anssi Kääriäinen Mon, 07 Dec 2015 23:20:08 -0800

On Friday, December 4, 2015 at 5:47:43 PM UTC+2, Paulo Maciel wrote:
>
> Combining multiple aggregations with annotate() will yield the wrong 
> results <https://code.djangoproject.com/ticket/10060>, as multiple tables 
> are cross joined. Due to the use of LEFT OUTER JOIN, duplicate records 
> will be generated if some of the joined tables contain more records than 
> the others
>


This has nothing to do with cross joins, usage of LEFT JOINS or having more 
records in one table than the others. The reason is that we should use 
subqueries where we currently use JOINs.

As a complex example, (one) correct SQL query for 
Book.objects.annotate(chapter_cnt=Count('chapter'), 
author_cnt=Count('author')) is:

  SELECT subq1.chapter_cnt, subq2.author_cnt, book_id
   FROM book
   LEFT JOIN (SELECT count(*) as chapter_cnt FROM chapter GROUP BY book_id) 
as subq1 ON subq1.book_id = book.book_id
   LEFT JOIN (SELECT count(*) as author_cnt FROM author GROUP BY book_id) 
as subq2 ON subq2.book_id = book.book_id
   GROUP BY book.book_id

This is still an easy case for the ORM. To add a bit more complexity, 
consider 
Book.objects.filter(Q(chapter__pages__lt=100)|Q(author__age__gt=50)).annotate(chapter_cnt=Count('chapter'),
 
author_cnt=Count('author')). I *think* we need a query like:

  SELECT subq1.chapter_cnt, subq2.author_cnt, book_id
   FROM book
   LEFT JOIN (SELECT count(*) as chapter_cnt FROM chapter WHERE 
chapter.pages < 100 GROUP BY book_id) as subq1 ON subq1.book_id = 
book.book_id
   LEFT JOIN (SELECT count(*) as author_cnt FROM author WHERE author.age > 
50 GROUP BY book_id) as subq2 ON subq2.book_id = book.book_id
  WHERE subq1.book_id IS NOT NULL OR subq2.book_id IS NOT NULL
   GROUP BY book.book_id

There are even more complex cases. For example you might need a subquery 
inside the generated subqueries in some cases.

I have some ideas of how to fix this (though in cases like this it is hard 
to know if the ideas actually work before trying). I believe fixing this 
takes at least a couple of months of development time. At least my 
development budget seems to be around one day a month currently.

An alternate approach is to error out when the ORM detects a complex enough 
query, and add explicit ways for users to add subqueries, subselects and 
stuff like that to ORM queries. This might make more sense than trying to 
generate complex queries automatically, as there are multiple ways to write 
the above query, and the different forms perform vastly differently 
depending on the data in the tables and the used database backend. Still, 
we can have it both ways - automatically generate a correct query, but also 
offer a way to explicitly add subqueries to ORM queries.

 - Anssi

>
>
> The Count 
> <https://docs.djangoproject.com/en/1.9/ref/models/querysets/#django.db.models.Count>
>  aggregate 
> has a distinct parameter that may help:
>
> q = Book.objects.annotate(Count('authors', distinct=True), Count('chapters', 
> distinct=True))
>
> Why not "distinct=True" to use in Sum?
>
> q = Book.objects.annotate(Sum('val_a', distinct=True), Sum('val_b', 
> distinct=True))
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers  (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/django-developers.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-developers/e32741dc-097e-4fba-b6ed-e831dc545c9a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Annotation failure (Combining multiple aggregations)

Reply via email to