#26530: Batch operations on large querysets
-------------------------------------+-------------------------------------
     Reporter:  mjtamlyn             |                    Owner:  nobody
         Type:  New feature          |                   Status:  new
    Component:  Database layer       |                  Version:  master
  (models, ORM)                      |
     Severity:  Normal               |               Resolution:
     Keywords:                       |             Triage Stage:
                                     |  Unreviewed
    Has patch:  0                    |      Needs documentation:  0
  Needs tests:  0                    |  Patch needs improvement:  0
Easy pickings:  0                    |                    UI/UX:  0
-------------------------------------+-------------------------------------

Comment (by mjtamlyn):

 There are a few ways of approaching it:

 Copying roughly a paginator:
 {{{
 count = qs.count()
 pointer = 0
 while pointer < count:
     for obj in qs[pointer:pointer + batch_size]:
         do_something(obj)
     pointer += batch_size
 }}}

 Basing off e.g. a sequential id, can also apply to time series
 {{{
 pointer = 0
 while True:
     # work from oldest first so incoming objects during the run will get
 processed
     batch = qs.filter(id__gt=pointer).order_by('id')[:batch_size]
     if not batch:
          break
     for obj in batch:
         pointer = obj.id
         do_something(obj)
 }}}

 The operation should also ideally apply to a values or values_list
 queryset, this is a similar piece of code which doesn't have to worry
 about memory as much:
 {{{
 ids = qs.values_list('id', flat=True)
 while user_ids:
     batch, user_ids = user_ids[:100], user_ids[100:]
     queue_task(batch)
 }}}

 -----

 My motivation for this patch is twofold - partly I'm bored of writing
 similar code when dealing with large querysets, but also I have seen many
 developers debugging issues with their code because they haven't realised
 10k+ querysets in memory are problematic. Having an easy API to use which
 is documented, with warnings about why you need this, should help people
 to be aware of the issues, and make it easy for them to fix them.

 A better API suggestion could be `for batch in qs.batch(size=100)`. This
 means quite possibly fixing your broken code is just changing one line.

--
Ticket URL: <https://code.djangoproject.com/ticket/26530#comment:2>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.

-- 
You received this message because you are subscribed to the Google Groups 
"Django updates" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-updates/066.4b4bddd600d9f7bbc5d2e7cd3390ed8b%40djangoproject.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to