Re: Implement QuerySet.contains?

Johan Schiff Wed, 03 Jun 2020 12:46:37 -0700

To answer my own question: No, I wasn't doing it correctly. I should have
done a sanity check before posting.


New timeit code and results at bottom, now using a smaller dataset (~900
objects).

Notable:

   - An .exists() query is about 1/100 the time of full fetch in this case.
   This difference would ofc be much bigger if I did it on the 100 000+
   objects.
   - If already prefetched, it should be worth it to search prefetched data
   if the dataset is <700 objects. It seems rare to prefetch that many
   objects. If so, a developer could easily be explicit and use .exists()
   method.

*Based on this exact scenario: 100 times slower for using obj in
queryset is quite a performance hit. On the other hand, an extra 1 % query
time for a queryset that is later evaluated is marginal.*

If a developer knows they want to evaluate the queryset later, it would be
nice to have a .evaluate() that does self._fetch_all() and returns self. I
think that's preferable over list(queryset)-method, because we don't need
to create an extra list.

# Old explicit fetch method
queryset = Stuff.objects.all()
if obj in list(queryset):
    pass

# My suggestion
queryset = Stuff.objects.all().evaluate()
if obj in queryset:
    pass



Updated timeit function and results:

import timeit

def time_fn(title, stmt, number=10_000):
    setup = [
        'from movies.models import Movie',
        'qs=Movie.objects.all()',
        'obj_first = qs.first()',
        'obj_mid = qs[300]',
        'obj_last = qs.last()',
        'list(qs)',
    ]
    result_time = timeit.timeit(stmt, setup='\n'.join(setup), number=number)
    print(f'{title}: {result_time}')

time_fn('Database fetch', 'qs._result_cache=None\nlist(qs)')
time_fn('Prefetched obj_first in qs', 'obj_first in qs')
time_fn('Prefetched obj_mid in qs', 'obj_mid in qs')
time_fn('Prefetched obj_last in qs', 'obj_last in qs')
time_fn('Database obj_first exists()', 'qs.filter(pk=obj_first.pk
).exists()')
time_fn('Database obj_mid exists()', 'qs.filter(pk=obj_mid.pk).exists()')
time_fn('Database obj_last exists()', 'qs.filter(pk=obj_last.pk).exists()')

# Results

Database fetch: 616.097227364
Prefetched obj_first in qs: 0.030961711003328674
Prefetched obj_mid in qs: 6.6988333979970776
Prefetched obj_last in qs: 24.189914419999695
Database obj_first exists(): 6.468764332996216
Database obj_mid exists(): 6.167532913001196
Database obj_last exists(): 6.0190791100030765


Le mer. 3 juin 2020 à 10:52, Johan Schiff <jo...@radkompaniet.se> a écrit :

> Thanks for great info.
>
> First, I'm leaning towards Aymeric's proposition here. I do recognize that
> there is a lot to consider.
>
> This seems to be important:
>
>    1. Developers must be able to explicitly choose methods to optimize
>    for their environment. (Considering database latency, dataset size,
>    prefetch preferred, etc.)
>    2. Behaviour should preferably be consistent across methods. (len(qs),
>    bool(qs), obj in qs)
>    3. Syntax should be pythonic.
>
> I think what Aymeric is proposing would fulfill those demands.
>
> I have done some timeit-tests on a model with 100 000+ records, based on
> Rogers work, on a single host setup using postgresql. My finding is that a
> separate database query is generally faster than current *obj in qs* 
> behaviour,
> even on a prefetched queryset (unless your dataset is really small). This
> tells me that we should prioritize .exists() query unless explicitly stated
> (i.e. *obj in iter(qs)*), even when we have prefetched data. For len(qs)
> and bool(qs) that should not be the case.
>
> It would be interesting to get timings from other setups, so I'm including
> the code I used. (Also, am I doing this correctly?)
>
> import timeit
>
> def time_fn(title, stmt, number=10000, prefetch=False):
>     setup = [
>         'from tickets.models import Ticket',
>         'qs=Ticket.objects.all()',
>         'obj_first = qs.first()',
>         'obj_mid = qs[50000]',
>         'obj_last = qs.last()',
>     ]
>     if prefetch:
>         setup.append('list(qs)')
>     result_time = timeit.timeit(stmt, setup='\n'.join(setup),
> number=number)
>     print(f'{title}: {result_time}')
>
> time_fn('Database fetch', 'list(qs)')
> time_fn('Prefetched obj_first in qs', 'obj_first in qs', prefetch=True)
> time_fn('Prefetched obj_mid in qs', 'obj_mid in qs', prefetch=True)
> time_fn('Prefetched obj_last in qs', 'obj_last in qs', prefetch=True)
> time_fn('Database obj_first exists()', 'qs.filter(pk=obj_first.pk
> ).exists()')
> time_fn('Database obj_mid exists()', 'qs.filter(pk=obj_mid.pk).exists()')
> time_fn('Database obj_last exists()', 'qs.filter(pk=obj_last.pk
> ).exists()')
>
>
> My results:
>
> Database fetch: 25.667968138001015
> Prefetched obj_first in qs: 0.027538340998944477
> Prefetched obj_mid in qs: 1051.1691511649988
> Prefetched obj_last in qs: 2369.5217889660016
> Database obj_first exists(): 6.442390248001175
> Database obj_mid exists(): 6.213641551999899
> Database obj_last exists(): 5.831252460000542
>
>
> Le mar. 2 juin 2020 à 23:15, Roger Gammans <rgamm...@gammascience.co.uk>
> a écrit :
>
>> On Tue, 2020-06-02 at 11:31 -0700, Javier Buzzi wrote:
>>
>>
>>
>> ps @roger
>>
>> >>> timeit.timeit('m.GL.objects.filter(pk=x.pk)', setup='import
>> myapp.models as m;x = m.GL.objects.all()[324]', number=100)
>> 0.05818330496549606
>>
>> is not doing anything, add a `.exists()` or `len(..)` or something to 
>> evaluate the queryset. That number is WAY too low.
>>
>>
>> Yes, of course. That makes me feel silly, on both counts - the
>> evaluations and the number of iterations. I made number of iterations
>> low;because I started with a test which a single iteration was taking
>> multiple seconds and it that case the default number was way to high for
>> the quick test I was offering. More critically the x in qs; depends on
>> where in the qs the object is and it doesn't matter how big if you only
>> look at the beginning.
>>
>> I reran it; but which shows different results.. See the following re run.
>> Construction here is more complicated to show avg microsecs; per iteration,
>> note the vary iteration lengths, actual runtime was aimed to be between
>> 20secs and a minute or so.
>>
>> >>> # Searching for first in queryset
>>
>> >>> n=10_000_000 ;timeit.timeit('x in qs', setup='import myapp.models as 
>> >>> m;qs = m.GL.objects.all(); qs2=qs;i=len(qs2);x=qs[0]', 
>> >>> number=n)/(n/1_000_000)
>>
>> 0.4136358265765011
>>
>> >>> n = 100_000 ; timeit.timeit('m.GL.objects.filter(pk=x.pk)', 
>> >>> setup='import myapp.models as m;x = list(m.GL.objects.all())[0]',  
>> >>> number=n)/(n/1_000_000)
>>
>> 124.99445254914463
>>
>> >>> # Searching for last in queryset.
>>
>> >>> n=1000; timeit.timeit('x in qs', setup='import myapp.models as m;qs = 
>> >>> m.GL.objects.all(); qs2=qs;i=len(qs2);x=qs[i-1]', number=n)/(n/1_000_000)
>>
>> 50741.098549216986
>>
>> >>> n = 100_000 ; timeit.timeit('m.GL.objects.filter(pk=x.pk)', 
>> >>> setup='import myapp.models as m;x = list(m.GL.objects.all())[-1]',  
>> >>> number=n)/(n/1_000_000)
>>
>> 118.99649207945913
>>
>>
>>
>>
>> --
>>
>> Roger Gammans <rgamm...@gammascience.co.uk>
>>
>> --
>> You received this message because you are subscribed to a topic in the
>> Google Groups "Django developers (Contributions to Django itself)" group.
>> To unsubscribe from this topic, visit
>> https://groups.google.com/d/topic/django-developers/NZaMq9BALrs/unsubscribe
>> .
>> To unsubscribe from this group and all its topics, send an email to
>> django-developers+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/django-developers/3928039038bac9b52279294f7efcac318dc80388.camel%40gammascience.co.uk
>> <https://groups.google.com/d/msgid/django-developers/3928039038bac9b52279294f7efcac318dc80388.camel%40gammascience.co.uk?utm_medium=email&utm_source=footer>
>> .
>>
>
>
> --
> Vänligen, Johan Schiff
> Radkompaniet AB
> 072-229 61 19
>


-- 
Vänligen, Johan Schiff
Radkompaniet AB
072-229 61 19

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers  (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to django-developers+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-developers/CAD69rUDECkftJxchNM3pxMnLw4gdzq3BWg0JLx5bt4gZRwg0rQ%40mail.gmail.com.

Re: Implement QuerySet.__contains__?

Reply via email to

Re: Implement QuerySet.contains?