#32152: Groupby after aggregation and subquery produces subtly wrong results
-------------------------------------+-------------------------------------
     Reporter:  Christian Klus       |                    Owner:  nobody
         Type:  Bug                  |                   Status:  new
    Component:  Database layer       |                  Version:  3.1
  (models, ORM)                      |
     Severity:  Normal               |               Resolution:
     Keywords:                       |             Triage Stage:
                                     |  Unreviewed
    Has patch:  0                    |      Needs documentation:  0
  Needs tests:  0                    |  Patch needs improvement:  0
Easy pickings:  0                    |                    UI/UX:  0
-------------------------------------+-------------------------------------
Description changed by Christian Klus:

Old description:

> Starting in django 3.0.7, specifically after patch #31566 I noticed some
> of my more complex queries returning incorrect results. I think I've
> narrowed it down to a simpler test case:
>
> Example query:
> {{{
> Book.objects.all().annotate(
>     pub_year=TruncYear('pubdate')
> ).order_by().values('pub_year').annotate(
>     total_pages=Sum('pages'),
>     top_rating=Subquery(
>         Book.objects.filter(
>             pubdate__year=OuterRef('pub_year')
>         ).order_by('rating').values('rating')[:1]
>     )
> ).values('pub_year', 'total_pages', 'top_rating')
> }}}
>
> Generated SQL on 3.0.6:
>
> {{{
> SELECT
>   django_date_trunc('year', "aggregation_regress_book"."pubdate") AS
> "pub_year",
>   SUM("aggregation_regress_book"."pages") AS "total_pages",
>   (
>     SELECT U0."rating"
>     FROM "aggregation_regress_book" U0
>     WHERE
>       django_date_extract('year', U0."pubdate") =
> django_date_trunc('year', "aggregation_regress_book"."pubdate")
>     ORDER BY U0."rating" ASC LIMIT 1
>   ) AS "top_rating"
> FROM "aggregation_regress_book"
> GROUP BY
>   django_date_trunc('year', "aggregation_regress_book"."pubdate"),
>   "top_rating"
> }}}
>
> Generated SQL on current master:
>
> {{{
> SELECT
>   django_date_trunc('year', "aggregation_regress_book"."pubdate", NULL,
> NULL) AS "pub_year",
>   SUM("aggregation_regress_book"."pages") AS "total_pages",
>   (
>     SELECT U0."rating"
>     FROM "aggregation_regress_book" U0
>     WHERE
>       django_date_extract('year', U0."pubdate") =
> django_date_trunc('year', "aggregation_regress_book"."pubdate", NULL,
> NULL)
>     ORDER BY U0."rating" ASC LIMIT 1
>   ) AS "top_rating"
> FROM "aggregation_regress_book"
> GROUP BY
>   django_date_trunc('year', "aggregation_regress_book"."pubdate", NULL,
> NULL),
>   (
>     SELECT U0."rating"
>     FROM "aggregation_regress_book" U0
>     WHERE
>       django_date_extract('year', U0."pubdate") =
> django_date_trunc('year', "aggregation_regress_book"."pubdate", NULL,
> NULL)
>     ORDER BY U0."rating" ASC LIMIT 1
>   ),
>   "aggregation_regress_book"."pubdate"
> }}}
>
> ----
>
> I see two separate issues here:
>   - {{{"aggregation_regress_book"."pubdate"}}} is being added to the
> group by clause incorrectly (this issue probably predates the patch
> mentioned above)
>   - Even though the subquery is in the select statement, the alias is not
> being used and instead the subquery is reevaluated. This nearly doubles
> the cost of one of my queries that is experiencing this problem.
>
> ----
>
> I don't know much about the ORM internals, but here is my naive patch for
> the second issue:
>
> {{{
> diff --git a/django/db/models/sql/query.py
> b/django/db/models/sql/query.py
> index ee98984826..6ea287d6cb 100644
> --- a/django/db/models/sql/query.py
> +++ b/django/db/models/sql/query.py
> @@ -2220,7 +2220,7 @@ class Query(BaseExpression):
>              # the selected fields anymore.
>              group_by = []
>              for expr in self.group_by:
> -                if isinstance(expr, Ref) and expr.refs not in
> field_names:
> +                if isinstance(expr, Ref) and expr.refs not in
> field_names + annotation_names:
>                      expr = self.annotations[expr.refs]
>                  group_by.append(expr)
>              self.group_by = tuple(group_by)
> }}}
>
> I'd appreciate anyone with deeper knowlege of the ORM to chime in and let
> me know if I'm on the right track. Tests are passing locally.
>
> The resulting query on master:
>
> {{{
> SELECT
>   django_date_trunc('year', "aggregation_regress_book"."pubdate", NULL,
> NULL) AS "pub_year",
>   SUM("aggregation_regress_book"."pages") AS "total_pages",
>   (
>     SELECT U0."rating"
>     FROM "aggregation_regress_book" U0
>     WHERE
>       django_date_extract('year', U0."pubdate") =
> django_date_trunc('year', "aggregation_regress_book"."pubdate", NULL,
> NULL)
>     ORDER BY U0."rating" ASC LIMIT 1
>   ) AS "top_rating"
> FROM "aggregation_regress_book"
> GROUP BY
>   django_date_trunc('year', "aggregation_regress_book"."pubdate", NULL,
> NULL),
>   "top_rating"
> }}}

New description:

 Starting in django 3.0.7, specifically after patch #31566 I noticed some
 of my more complex queries returning incorrect results. I think I've
 narrowed it down to a simpler test case:

 Example query:
 {{{#!python
 Book.objects.all().annotate(
     pub_year=TruncYear('pubdate')
 ).order_by().values('pub_year').annotate(
     total_pages=Sum('pages'),
     top_rating=Subquery(
         Book.objects.filter(
             pubdate__year=OuterRef('pub_year')
         ).order_by('rating').values('rating')[:1]
     )
 ).values('pub_year', 'total_pages', 'top_rating')
 }}}

 Generated SQL on 3.0.6:

 {{{#!sql
 SELECT
   django_date_trunc('year', "aggregation_regress_book"."pubdate") AS
 "pub_year",
   SUM("aggregation_regress_book"."pages") AS "total_pages",
   (
     SELECT U0."rating"
     FROM "aggregation_regress_book" U0
     WHERE
       django_date_extract('year', U0."pubdate") =
 django_date_trunc('year', "aggregation_regress_book"."pubdate")
     ORDER BY U0."rating" ASC LIMIT 1
   ) AS "top_rating"
 FROM "aggregation_regress_book"
 GROUP BY
   django_date_trunc('year', "aggregation_regress_book"."pubdate"),
   "top_rating"
 }}}

 Generated SQL on current master:

 {{{#!sql
 SELECT
   django_date_trunc('year', "aggregation_regress_book"."pubdate", NULL,
 NULL) AS "pub_year",
   SUM("aggregation_regress_book"."pages") AS "total_pages",
   (
     SELECT U0."rating"
     FROM "aggregation_regress_book" U0
     WHERE
       django_date_extract('year', U0."pubdate") =
 django_date_trunc('year', "aggregation_regress_book"."pubdate", NULL,
 NULL)
     ORDER BY U0."rating" ASC LIMIT 1
   ) AS "top_rating"
 FROM "aggregation_regress_book"
 GROUP BY
   django_date_trunc('year', "aggregation_regress_book"."pubdate", NULL,
 NULL),
   (
     SELECT U0."rating"
     FROM "aggregation_regress_book" U0
     WHERE
       django_date_extract('year', U0."pubdate") =
 django_date_trunc('year', "aggregation_regress_book"."pubdate", NULL,
 NULL)
     ORDER BY U0."rating" ASC LIMIT 1
   ),
   "aggregation_regress_book"."pubdate"
 }}}

 ----

 I see two separate issues here:
   - {{{"aggregation_regress_book"."pubdate"}}} is being added to the group
 by clause incorrectly (this issue probably predates the patch mentioned
 above)
   - Even though the subquery is in the select statement, the alias is not
 being used and instead the subquery is reevaluated. This nearly doubles
 the cost of one of my queries that is experiencing this problem.

 ----

 I don't know much about the ORM internals, but here is my naive patch for
 the second issue:

 {{{
 diff --git a/django/db/models/sql/query.py b/django/db/models/sql/query.py
 index ee98984826..6ea287d6cb 100644
 --- a/django/db/models/sql/query.py
 +++ b/django/db/models/sql/query.py
 @@ -2220,7 +2220,7 @@ class Query(BaseExpression):
              # the selected fields anymore.
              group_by = []
              for expr in self.group_by:
 -                if isinstance(expr, Ref) and expr.refs not in
 field_names:
 +                if isinstance(expr, Ref) and expr.refs not in field_names
 + annotation_names:
                      expr = self.annotations[expr.refs]
                  group_by.append(expr)
              self.group_by = tuple(group_by)
 }}}

 I'd appreciate anyone with deeper knowlege of the ORM to chime in and let
 me know if I'm on the right track. Tests are passing locally.

 The resulting query on master:

 {{{#!sql
 SELECT
   django_date_trunc('year', "aggregation_regress_book"."pubdate", NULL,
 NULL) AS "pub_year",
   SUM("aggregation_regress_book"."pages") AS "total_pages",
   (
     SELECT U0."rating"
     FROM "aggregation_regress_book" U0
     WHERE
       django_date_extract('year', U0."pubdate") =
 django_date_trunc('year', "aggregation_regress_book"."pubdate", NULL,
 NULL)
     ORDER BY U0."rating" ASC LIMIT 1
   ) AS "top_rating"
 FROM "aggregation_regress_book"
 GROUP BY
   django_date_trunc('year', "aggregation_regress_book"."pubdate", NULL,
 NULL),
   "top_rating"
 }}}

--

-- 
Ticket URL: <https://code.djangoproject.com/ticket/32152#comment:2>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.

-- 
You received this message because you are subscribed to the Google Groups 
"Django updates" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to django-updates+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-updates/065.7ddca347495bbb7b060b4155ad02b68d%40djangoproject.com.

Reply via email to