Re: [Django] #34597: Queryset (split) exclude's usage of Exists is significantly slower than subquery (was: Django ORM query SQL generation changed (and is apparently a lot slower))

Django Sun, 28 May 2023 20:30:18 -0700

#34597: Queryset (split) exclude's usage of Exists is significantly slower than
subquery
-------------------------------------+-------------------------------------
     Reporter:  Lorand Varga         |                    Owner:  nobody
         Type:                       |                   Status:  new
  Cleanup/optimization               |
    Component:  Database layer       |                  Version:  3.2
  (models, ORM)                      |
     Severity:  Normal               |               Resolution:
     Keywords:                       |             Triage Stage:
                                     |  Unreviewed
    Has patch:  0                    |      Needs documentation:  0
  Needs tests:  0                    |  Patch needs improvement:  0
Easy pickings:  0                    |                    UI/UX:  0
-------------------------------------+-------------------------------------


Comment (by Simon Charette):

 Thank you for providing query plans, this is helpful.

 While it's undeniable that in this particular case the query is slower,
 [https://wiki.postgresql.org/wiki/Don't_Do_This#Don.27t_use_NOT_IN
 Postgres wiki itself advise against using NOT IN] so there is certainly
 something else to blame here.

 The first thing that is suspicious to me is that the new query is using
 parallel workers which are know to be causing some slowdown when not
 appropriately configured. For example, if running the first query gets you
 close to your `work_mem` then the second one will definitely will cause
 spills to disk and could explain the significant difference in execution
 duration. What I see from the plan is that Postgres believes that it will
 be faster to parallelize the query but when it tries to materialize the
 results returned by workers [https://explain.depesz.com/s/KPXv#stats it
 takes ages].

 Tuning this parameter is [https://www.percona.com/blog/parallel-queries-
 in-postgresql/ definitely out of scope for this ticket] as it dips into
 user / DBA support territory but it'd be great to confirm that it's the
 reason you are encountering this issue.

 > I've managed to use your advice and modify the query to force the
 subquery to use IN but that leads to synthax that is more complex, while
 the django 2.2 synthax is a lot simpler: ~Q(translation=None)
 >
 > While not terrible I do think something has been broken in django since
 the ~Q(translation=None) seems a lot cleaner and more ORMish.

 I've never argue otherwise, it was solely meant as a way to unblock you
 from upgrading from a Django version that has been unsupported for months
 and that I assumed you were eager to move away from.

 As for what ''ORMish'' means that is open for interpretation. I would
 argue that `~Q(translation=None)` is extremely ambiguous given how `None`
 doesn't map 1:1 to SQL NULL semantics that the objective of an ORM is to
 abstract SQL operations to do the right thing most of the time while
 providing escape hatches for the remaining cases. In this particular case
 I've provided you a reasoning about why these changes were made so
 hopefully it's clear to you why we need substantial evidence before
 reverting these changes given they have already been part of two LTS
 without prior reports of performance degradation.

 In order to determine what might the origin your issue I'd ask you to run
 the following queries while preceding them with a `SET
 max_parallel_workers_per_gather = 0`

 {{{#!sql
 EXPLAIN ANALYZE
 SELECT
   "test_app_blog"."id",
   "test_app_blog"."created_at",
   "test_app_blog"."updated_at",
   "test_app_blog"."is_published",
   "test_app_blog"."api_has_translation"
 FROM
   "test_app_blog"
 WHERE
   (
     "test_app_blog"."is_published"
     AND NOT "test_app_blog"."api_has_translation"
     AND NOT (
       EXISTS(
         SELECT
           (1) AS "a"
         FROM
           "test_app_blog" U0
           LEFT OUTER JOIN "test_app_translation" U1 ON (U0."id" =
 U1."blog_id")
         WHERE
           (
             U1."id" IS NULL
             AND U0."id" = "test_app_blog"."id"
           )
         LIMIT
           1
       )
     )
   )
 ORDER BY
   "test_app_blog"."updated_at" DESC
 }}}

 {{{#!sql
 EXPLAIN ANALYZE
 SELECT
   "test_app_blog"."id",
   "test_app_blog"."created_at",
   "test_app_blog"."updated_at",
   "test_app_blog"."is_published",
   "test_app_blog"."api_has_translation"
 FROM
   "test_app_blog"
 WHERE
   (
     "test_app_blog"."is_published"
     AND NOT "test_app_blog"."api_has_translation"
     AND EXISTS(
         SELECT
           (1) AS "a"
         FROM
           "test_app_translation" U1
         WHERE
           U1."blog_id" = "test_app_blog"."id"
         LIMIT
           1
       )
     )
   )
 ORDER BY
   "test_app_blog"."updated_at" DESC
 }}}

-- 
Ticket URL: <https://code.djangoproject.com/ticket/34597#comment:3>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.

-- 
You received this message because you are subscribed to the Google Groups 
"Django updates" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-updates/01070188658e66fd-6cbfbca3-9954-4c4c-909a-a6398a83e96d-000000%40eu-central-1.amazonses.com.

Re: [Django] #34597: Queryset (split) exclude's usage of Exists is significantly slower than subquery (was: Django ORM query SQL generation changed (and is apparently a lot slower))

Reply via email to