#28455: Create "inplace" QuerySets to speed up certain operations
-------------------------------------+-------------------------------------
Reporter: Anssi Kääriäinen | Owner: Keryn
Type: | Knight
Cleanup/optimization | Status: assigned
Component: Database layer | Version: dev
(models, ORM) |
Severity: Normal | Resolution:
Keywords: | Triage Stage: Accepted
Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------
Comment (by Keryn Knight):
PR is https://github.com/django/django/pull/14675
Based heavily on Anssi's original implementation idea. This introduces 2
ways of opting out of cloning, both ostensibly private API. Firstly and
primarily, a context manager which temporarily makes the current QuerySet
(+ Query, implicitly) mutable:
{{{
with User.objects.all()._avoid_cloning() as queryset1:
queryset1.filter(x=1).exclude(y=2).filter(z__in=[1,2,3])
# OR
queryset2 = User.objects.all()
with queryset2._avoid_cloning():
queryset2.filter(x=1).exclude(y=2).filter(z__in=[1,2,3])
}}}
And secondly, an imperative form, for use cases where there might be
multiple filters across a long list of lines, where it's not desirable to
muddy the VCS history and/or the indent depth by using the context
manager:
{{{
queryset = User.objects.all()._disable_cloning()
queryset.filter(x=1).exclude(y=2).filter(z__in=[1,2,3])
queryset._enable_cloning()
queryset2 = queryset.filter(abc='test') # This should be a new clone, as
usual.
}}}
There's currently no support for any sort of nesting or ensured balancing
(as with the original implementation, as I understand it), so
`._disable_cloning()._disable_cloning()._disable_cloning()` would be
undone by a single `._enable_cloning()` call because it's a toggle rather
than a stack of pushes/pops; hence the preference for the contextmanager.
Given the following SQLite data:
{{{
In [1]: from django.contrib.auth.models import User, Group, Permission
In [2]: Group.objects.count()
Out[2]: 101
In [3]: User.objects.count()
Out[3]: 101
In [4]: Permission.objects.count()
Out[4]: 24
In [5]: User.user_permissions.through.count()
Out[5]: 0
In [6]: for user in User.objects.all():
...: print(f'{user.groups.count()}, {user.groups.first().pk},
{user.user_permissions.count()}')
1, 1, 0
1, 2, 0
1, 3, 0
1, 4, 0
1, 5, 0
... they all have 1 group and zero permissions
In[7]: from django.db import connection
In[8]: connection.queries[-2:]
[{'sql': 'SELECT "auth_user"."id", ... FROM "auth_user"',
'time': '0.001'},
{'sql': 'SELECT ("auth_user_groups"."user_id") AS
"_prefetch_related_val_user_id", "auth_group"."id", "auth_group"."name"
FROM "auth_group" INNER JOIN "auth_user_groups" ON ("auth_group"."id" =
"auth_user_groups"."group_id") WHERE "auth_user_groups"."user_id" IN (1,
2, 3, [...] 99, 100, 101)',
'time': '0.000'}]
}}}
Before applying any of the changes:
{{{
In [2]: %timeit tuple(User.objects.all())
1.26 ms ± 9.06 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
(0 calls to ._clone)
In [3]: %timeit tuple(User.objects.prefetch_related('groups'))
6.52 ms ± 62.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
(105 calls to ._clone)
In [4]: %timeit tuple(User.objects.prefetch_related('groups',
'user_permissions'))
12.1 ms ± 226 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
(213 calls to ._clone)
}}}
After enabling the avoidance cloning technique for prefetch_related only:
{{{
In [2]: %timeit tuple(User.objects.all())
1.28 ms ± 16.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
(0 calls to ._clone)
In [3]: %timeit tuple(User.objects.prefetch_related('groups'))
5.93 ms ± 53.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
(4 calls to ._clone)
In [4]: %timeit tuple(User.objects.prefetch_related('groups',
'user_permissions'))
10.2 ms ± 59.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
(7 calls to ._clone)
}}}
So for the 1 prefetch it's a decrease of maybe ~9%, and for the 2
prefetches it's a decrease of roughly ~15%. I make no pretense about being
able to calculate improvement by factor vs percent and risk getting it
wrong.
The number of calls to clone was done by re-running the queries in fresh
`ipython` sessions, having `_clone()` increment a global integer each time
it's called. It isn't part of the actual timeit numbers.
These improved prefetch timings are only found when hitting the
`manager.get_queryset()` of `prefetch_one_level` because we can assume
that returns a fresh QuerySet instance (or subclass) that holds OK data
and doesn't need cloning.
The same cannot be said of `if leaf and lookup.queryset [...]` I don't
think, because that queryset is shared and in an unknown state, so at
least for the initial MVP remains cloned. It ''may'' do one fewer clone
operation ''if'' the database would be changed by `using()`.
Calling chain directly, without disabling cloning:
{{{
In [1]: from django.contrib.auth.models import User, Group, Permission
In [2]: x = User.objects.all()
In [4]: %timeit y = x._chain()
7.05 µs ± 62.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
}}}
With cloning disabled:
{{{
In [6]: %timeit y = x._chain()
275 ns ± 3.15 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
}}}
As I write this, I have not implemented the other updates Anssi did like
update `get_contenttypes_and_models` or `create_permissions` ''et al'' so
that the proposed API changes can be decided on and looked over before
moving forward with other potential usages.
Additionally as I write this, I've added no new tests to properly
demonstrate the correctness of it, for the same reasons. For the
discussion to kick off, it's going to be enough to see that the test suite
passes or fails and iterate from there.
The number of `QuerySet._clone` operations done by the test suite (Ran
14876 tests, skipped=1192, expected failures=4) drops from `144851` to
`142305` which isn't a huge amount of confidence, but it's a start.
--
Ticket URL: <https://code.djangoproject.com/ticket/28455#comment:4>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.
--
You received this message because you are subscribed to the Google Groups
"Django updates" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/django-updates/066.feac3cd8426b022776246e09fae79662%40djangoproject.com.