#36157: Unusual behaviour when pre-fetching with only applied on the related 
fields
-------------------------------+--------------------------------------
     Reporter:  Tim McCurrach  |                    Owner:  (none)
         Type:  Bug            |                   Status:  new
    Component:  Uncategorized  |                  Version:  5.1
     Severity:  Normal         |               Resolution:
     Keywords:                 |             Triage Stage:  Unreviewed
    Has patch:  0              |      Needs documentation:  0
  Needs tests:  0              |  Patch needs improvement:  0
Easy pickings:  0              |                    UI/UX:  0
-------------------------------+--------------------------------------
Description changed by Tim McCurrach:

Old description:

> When prefetching related models. If you apply `.only()` to the related
> queryset, django performs additional lookups for related IDs that have
> been left out of the `only`.
>
> It is probably easiest to explain the issue with an example.
>
> === Example Situation ===
>
> Suppose you have these models:
>
> {{{
> class Blog(models.Model):
>     name = models.CharField(max_length=100)
>
> class Post(models.Model):
>     name = models.CharField(max_length=100)
>     blog = models.ForeignKey(Blog, on_delete=models.CASCADE,
> related_name="posts")
>     ...lots of other big fields
> }}}
> And you create a few items for each:
> {{{
> blog = Blog.objects.create(name="Django Tricks")
> blog2 = Blog.objects.create(name="React Tricks")
> Post.objects.create(name="prefetching", blog=blog)
> Post.objects.create(name="models", blog=blog)
> Post.objects.create(name="templates", blog=blog)
> Post.objects.create(name="hooks", blog=blog2)
> Post.objects.create(name="components", blog=blog2)
> }}}
> If I wish to pre-fetch the posts for some blogs, but only want the names
> of each post, rather than the content of each post I can do the
> following:
> {{{
> Blog.objects.prefetch_related(Prefetch("posts",
> queryset=Post.objects.only("name")))
> }}}
> I would expect this to result in just 2 database queries. One to fetch
> the data for the `Blog` instances, and another to fetch the the data for
> the related `Post`s. Instead, there is an n+1 issue where there are 5
> extra follow up requests for each of the related `Post` instances. This
> is the SQL that is generated:
> {{{
> SELECT "app_blog"."id", "app_blog"."name" FROM "app_blog" LIMIT 21
> SELECT "app_post"."id", "app_post"."name" FROM "app_post" WHERE
> "app_post"."blog_id" IN (1, 2)
> SELECT "app_post"."id", "app_post"."blog_id" FROM "app_post" WHERE
> "app_post"."id" = 1 LIMIT 21
> SELECT "app_post"."id", "app_post"."blog_id" FROM "app_post" WHERE
> "app_post"."id" = 2 LIMIT 21
> SELECT "app_post"."id", "app_post"."blog_id" FROM "app_post" WHERE
> "app_post"."id" = 3 LIMIT 21
> SELECT "app_post"."id", "app_post"."blog_id" FROM "app_post" WHERE
> "app_post"."id" = 4 LIMIT 21
> SELECT "app_post"."id", "app_post"."blog_id" FROM "app_post" WHERE
> "app_post"."id" = 5 LIMIT 21
> }}}
> I can understand it might be a good idea to have the related-id's for the
> blog on hand should you need them later. But I also think, that by using
> `.only()` you are explicitly telling django - I don't need these. This is
> a real problem for larger data-sets, where you end up with thousands of
> extra round-trips to the database.
>
> === Context ===
>
> This is an issue that came up in the wild. I'm using a third-party
> optimiser that improves the performance of graphQL queries by decorating
> querysets with `only()`, `select_related()` etc. It correctly identifies
> that I am only using certain fields and applies `only()` to them, knowing
> I will never need to access certain related fields. This results in
> django producing many additional hits to the database.

New description:

 When prefetching related models. If you apply `.only()` to the related
 queryset, django performs additional lookups for related IDs that have
 been left out of the `only`.

 It is probably easiest to explain the issue with an example.

 === Example Situation ===

 Suppose you have these models:

 {{{
 class Blog(models.Model):
     name = models.CharField(max_length=100)

 class Post(models.Model):
     name = models.CharField(max_length=100)
     blog = models.ForeignKey(Blog, on_delete=models.CASCADE,
 related_name="posts")
     ...lots of other big fields
 }}}
 And you create a few items for each:
 {{{
 blog = Blog.objects.create(name="Django Tricks")
 blog2 = Blog.objects.create(name="React Tricks")
 Post.objects.create(name="prefetching", blog=blog)
 Post.objects.create(name="models", blog=blog)
 Post.objects.create(name="templates", blog=blog)
 Post.objects.create(name="hooks", blog=blog2)
 Post.objects.create(name="components", blog=blog2)
 }}}
 If I wish to pre-fetch the posts for some blogs, but only want the names
 of each post, rather than the content of each post I can do the following:
 {{{
 Blog.objects.prefetch_related(Prefetch("posts",
 queryset=Post.objects.only("name")))
 }}}
 I would expect this to result in just 2 database queries. One to fetch the
 data for the `Blog` instances, and another to fetch the the data for the
 related `Post`s. Instead, there is an n+1 issue where there are 5 extra
 follow up requests for each of the related `Post` instances. This is the
 SQL that is generated:
 {{{
 SELECT "app_blog"."id", "app_blog"."name" FROM "app_blog" LIMIT 21
 SELECT "app_post"."id", "app_post"."name" FROM "app_post" WHERE
 "app_post"."blog_id" IN (1, 2)
 SELECT "app_post"."id", "app_post"."blog_id" FROM "app_post" WHERE
 "app_post"."id" = 1 LIMIT 21
 SELECT "app_post"."id", "app_post"."blog_id" FROM "app_post" WHERE
 "app_post"."id" = 2 LIMIT 21
 SELECT "app_post"."id", "app_post"."blog_id" FROM "app_post" WHERE
 "app_post"."id" = 3 LIMIT 21
 SELECT "app_post"."id", "app_post"."blog_id" FROM "app_post" WHERE
 "app_post"."id" = 4 LIMIT 21
 SELECT "app_post"."id", "app_post"."blog_id" FROM "app_post" WHERE
 "app_post"."id" = 5 LIMIT 21
 }}}
 I can understand it might be a good idea to have the related-id's for the
 blog on hand should you need them later. But I also think, that by using
 `.only()` you are explicitly telling django - I don't need these. This is
 a real problem for larger data-sets, where you end up with thousands of
 extra round-trips to the database.

 === Context ===

 This is an issue that came up in the wild. I'm using a third-party
 optimiser that improves the performance of graphQL queries by decorating
 querysets with `only()`, `select_related()` etc. It correctly identifies
 that I am only using certain fields and applies `only()` to them, knowing
 I will never need to access certain related fields. This unfortunately
 results in django producing many additional hits to the database. I don't
 think this is expected behaviour from django.

--
-- 
Ticket URL: <https://code.djangoproject.com/ticket/36157#comment:2>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.

-- 
You received this message because you are subscribed to the Google Groups 
"Django updates" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion visit 
https://groups.google.com/d/msgid/django-updates/01070194b42d8739-dc12ab21-de63-4aca-877d-b0eabe1420aa-000000%40eu-central-1.amazonses.com.

Reply via email to