On Tuesday, December 8, 2015 at 9:19:52 AM UTC+2, Anssi Kääriäinen wrote:
>
> On Friday, December 4, 2015 at 5:47:43 PM UTC+2, Paulo Maciel wrote:
>>
>> Combining multiple aggregations with annotate() will yield the wrong
>> results <https://code.djangoproject.com/ticket/10060>, as multiple
>> tables are cross joined. Due to the use of LEFT OUTER JOIN, duplicate
>> records will be generated if some of the joined tables contain more records
>> than the others
>>
>
> This has nothing to do with cross joins, usage of LEFT JOINS or having
> more records in one table than the others. The reason is that we should use
> subqueries where we currently use JOINs.
>
> As a complex example, (one) correct SQL query for
> Book.objects.annotate(chapter_cnt=Count('chapter'),
> author_cnt=Count('author')) is:
>
> SELECT subq1.chapter_cnt, subq2.author_cnt, book_id
> FROM book
> LEFT JOIN (SELECT count(*) as chapter_cnt FROM chapter GROUP BY
> book_id) as subq1 ON subq1.book_id = book.book_id
> LEFT JOIN (SELECT count(*) as author_cnt FROM author GROUP BY book_id)
> as subq2 ON subq2.book_id = book.book_id
> GROUP BY book.book_id
>
> This is still an easy case for the ORM. To add a bit more complexity,
> consider
> Book.objects.filter(Q(chapter__pages__lt=100)|Q(author__age__gt=50)).annotate(chapter_cnt=Count('chapter'),
>
> author_cnt=Count('author')). I *think* we need a query like:
>
> SELECT subq1.chapter_cnt, subq2.author_cnt, book_id
> FROM book
> LEFT JOIN (SELECT count(*) as chapter_cnt FROM chapter WHERE
> chapter.pages < 100 GROUP BY book_id) as subq1 ON subq1.book_id =
> book.book_id
> LEFT JOIN (SELECT count(*) as author_cnt FROM author WHERE author.age >
> 50 GROUP BY book_id) as subq2 ON subq2.book_id = book.book_id
> WHERE subq1.book_id IS NOT NULL OR subq2.book_id IS NOT NULL
> GROUP BY book.book_id
>
> There are even more complex cases. For example you might need a subquery
> inside the generated subqueries in some cases.
>
> I have some ideas of how to fix this (though in cases like this it is hard
> to know if the ideas actually work before trying). I believe fixing this
> takes at least a couple of months of development time. At least my
> development budget seems to be around one day a month currently.
>
> An alternate approach is to error out when the ORM detects a complex
> enough query, and add explicit ways for users to add subqueries, subselects
> and stuff like that to ORM queries. This might make more sense than trying
> to generate complex queries automatically, as there are multiple ways to
> write the above query, and the different forms perform vastly differently
> depending on the data in the tables and the used database backend. Still,
> we can have it both ways - automatically generate a correct query, but also
> offer a way to explicitly add subqueries to ORM queries.
>
I started writing some notes about the changes needed to handle
automatically any .filter().annotate() combo. The notes are of very bad
quality, but I'll publish them here even if that is the case. The point is
that there is too much work, and the end result would be something where
users don't understand why they get the queries they are getting. Still,
the code would be complex enough to be extremely fragile.
I strongly believe the approach must be that we offer users explicit ways
to modify the query (ability to join subqueries will get us a long way
here), and that Django simply errors out when it sees query constructs it
can't handle. We can later on maybe automate *some* of the error cases, but
doing it for all cases seems extremely hard.
The notes assume we implement filter() for multivalued joins by doing
subqueries instead of using joins. This is in practice needed for the
approach to work at all.
When we do qs = Organization.objects.filter(subunits__size__gte=20)
- add a new subquery condition to the query's where. Note that we added a
subquery
condition for "subunits".
- query at this point is:
select * from ... where pk in (select parent_unit_id from
organization where size >= 20)
Next, when we do qs = qs.annotate(sum=Sum('subunits__size'))
- We find the existing where condition (with a subquery) for "subunits"
- Remove the subquery condition from the where condition, decrease the
refcount for leading joins to that subquery by 1
- Add the sum to the subquery
- Add the subquery as join to the query (*now* we can do the filter as a
join. The
group by in the subquery enforces at maximum a single row per row in
the outer query.).
The join type should be INNER, except in cases where the subquery
condition was ORed
in the outer query.
- Add subq.sum to the outer query's annotations.
- Add "subq.select[0] IS NOT NULL" to the outer query's where (in place
of the removed subquery
contidion)
- Mark that subunits has now a joined subquery
- Note that we must use the same GROUP BY in the joined subquery than we
do for the outer query.
- So, for example
Organization.values('type').filter(subunits__size__gte=20).annotate(avg_size=Avg('subunits__size'))
-> Meaning: for each type in Organization, calculcate the average
size of subunits, but only for subunits which are
of size 20 or bigger.
-> Query:
SELECT organization.type, subq.avg_size
FROM organization
INNER JOIN (SELECT U0.type, AVG(U1.size) as avg_age
FROM organization U0
INNER JOIN organization U1
WHERE U1.size >= 20
GROUP BY U0.name) subq ON organization.type =
subq.type;
Even if we were to do the above, we still wouldn't handle all legal Django
queries correctly.
- Anssi
--
You received this message because you are subscribed to the Google Groups
"Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/django-developers.
To view this discussion on the web visit
https://groups.google.com/d/msgid/django-developers/a81441fc-36ed-4535-a7ed-3572999a1ee9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.