#26530: Batch operations on large querysets
-------------------------------------+-------------------------------------
Reporter: mjtamlyn | Owner: nobody
Type: New feature | Status: new
Component: Database layer | Version: master
(models, ORM) |
Severity: Normal | Resolution:
Keywords: | Triage Stage:
| Unreviewed
Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------
Comment (by mjtamlyn):
There are a few ways of approaching it:
Copying roughly a paginator:
{{{
count = qs.count()
pointer = 0
while pointer < count:
for obj in qs[pointer:pointer + batch_size]:
do_something(obj)
pointer += batch_size
}}}
Basing off e.g. a sequential id, can also apply to time series
{{{
pointer = 0
while True:
# work from oldest first so incoming objects during the run will get
processed
batch = qs.filter(id__gt=pointer).order_by('id')[:batch_size]
if not batch:
break
for obj in batch:
pointer = obj.id
do_something(obj)
}}}
The operation should also ideally apply to a values or values_list
queryset, this is a similar piece of code which doesn't have to worry
about memory as much:
{{{
ids = qs.values_list('id', flat=True)
while user_ids:
batch, user_ids = user_ids[:100], user_ids[100:]
queue_task(batch)
}}}
-----
My motivation for this patch is twofold - partly I'm bored of writing
similar code when dealing with large querysets, but also I have seen many
developers debugging issues with their code because they haven't realised
10k+ querysets in memory are problematic. Having an easy API to use which
is documented, with warnings about why you need this, should help people
to be aware of the issues, and make it easy for them to fix them.
A better API suggestion could be `for batch in qs.batch(size=100)`. This
means quite possibly fixing your broken code is just changing one line.
--
Ticket URL: <https://code.djangoproject.com/ticket/26530#comment:2>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.
--
You received this message because you are subscribed to the Google Groups
"Django updates" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/django-updates/066.4b4bddd600d9f7bbc5d2e7cd3390ed8b%40djangoproject.com.
For more options, visit https://groups.google.com/d/optout.