Re: Automatic prefetching in querysets

Luke Plant Wed, 16 Aug 2017 05:48:52 -0700

Hi Josh,


On 16/08/17 02:26, Josh Smeaton wrote:

I believe we should be optimising for the **common** use case, notexpecting everyone to be experts with the ORM.
> If I can come up with a single example where it would significantlydecrease performance (either memory usage or speed) compared to thedefault (and I'm sure I can), then I would be strongly opposed to itever being default behaviour.
The status quo is already one where thousands of users and sites aredoing the non-optimal thing because we're choosing to be conservativeand have users opt-in to the optimal behaviour.

I wouldn't say it is "optimal behaviour". It is behaviour that is anoptimization for some use cases - common ones, I'd agree - but not all.In fact it is not even the best optimization - it performs less well inmany cases than a manual `select_related`.

We don't know how many sites would be affected by the opposite default,because we aren't putting people in that situation. Generatingpotentially large queries (i.e. ones that return lots of results) isgoing to make someone's life a lot harder, and can even crash processes(out of memory errors), or cause massive slowdown due to insufficientmemory and swapping. I have had these kind of issues in productionsystems, and sometimes the answer is to prefetch *less* - which is whythings like `iterator()` exist, because sometimes you *don't* want toload it all into memory upfront.

A massive complaint against Django is how easy it is for users tobuild in 1+N behaviour. Django is supposed to abstract the databaseaway (so much so that we avoid SQL related terms in our querysetmethods), yet one of the more fundamental concepts such as joins weexpect users to know about and optimise for.

This is far from the only place where we expect users to be conscious ofwhat has to go on at the database level. For example, we expect users touse `.count()` and `.exists()` where appropriate, and not use them wherenot appropriate - seehttps://docs.djangoproject.com/en/1.11/topics/db/optimization/#don-t-overuse-count-and-exists

This is an example of doing it right. We could have 'intelligently' made`__len__()` do `count()`, but this introduces queries that the user didnot ask for, and that's hard for developers to predict. That wholesection of the docs on DB optimisation depends on the possibility ofunderstanding things at a DB level, and understanding how QuerySetsbehave, and that they only do the queries that you ask them to do. Themore magic we build in, the harder we make life for people trying tooptimize database access.

If we have an `auto_prefetch` method that has to be called explicitly,then we are allowing users who know less about databases to getsomething that works OK for many situations. But having it on by defaultmakes optimization harder and adds unwelcome surprises.

I'd be more in favour of throwing an error onnon-select-related-foreign-key-access than what we're currently doingwhich is a query for each access.

I have some sympathy with this, but I doubt it is somewhere we can gofrom the current ORM and Django ecosystem as a default - but see below.


The only options users currently have of monitoring poor behaviour is:

1. Add logging to django.db.models
2. Add django-debug-toolbar
3. Investigate page slow downs

Here's a bunch of ways that previously tuned queries can "go bad":

1. A models `__str__` method is updated to include a related field
2. A template uses a previously unused related field
3. A report uses a previously unused related field
4. A ModelAdmin adds a previously unused related field

I completely agree that visibility of this problem is a major issue, andwould really welcome work on improving this, especially in DEBUG mode.One option might be a method that replaces lazy loads with exceptions.This would force you to add the required `prefetch_related` or`select_related` calls. You could have:


.lazy_load(None)   # exception on accessing any non-loaded FK objects

.lazy_load('my_fk1', 'my_fk2') # exceptions on accessing any unloadedFK objects apart from the named ones


.lazy_load('__any__')   # cancel the above, restore default behaviour

This (or something like it) would be a different way to tackle theproblem - just throwing it out. You could have a Meta option to do`.lazy_load(None)` by default for a Model. I've no idea how practicalit would be to wire this all up so that is works correctly, and withnested objects etc.

This would be a bit of a departure from the current ORM, especially ifwe wanted to migrate to it being the default behaviour. If we arethinking long term, we might have to move to something like this if thefuture is async. Lazy-loading ORMs are not very friendly to async code,but an ORM where you must specify a set of queries to execute up-frontmight be more of a possibility.

I think a better question to ask is:
- How many people have had their day/site ruined because we thinkauto-prefetching is too magical?- How many people would have their day/site ruined because we thinkauto-prefetching is the better default?

Answer - we could guess, but we don't know, and we still have problemsboth way, of different frequencies and size. In this case, I think"explicit is better than implicit" is the wisdom to fall back on. Weshould also fall back to the experience we've had with this in the past.We did have more magical performance fixes in the past, such as the`depth` argument to `select_related`, removed in 1.7. You now have toexplicitly name the things you want to be selected, instead of the'easy' option which did the wrong thing in some cases.

If we were introducing a new ORM, I think the above answer would beobvious given what we know of Django use today.
What I'd propose:

1. (optional) A global setting to disable autoprefetching

This would almost surely interact very badly with 3rd party librariesand apps who will all want you to set it differently.

2. An opt out per queryset
3. (optional) An opt out per Meta?
4. Logging any autoprefetches - perhaps as a warning?

Logging of possible improvements or auto-prefetches could be reallyhelpful. We should be careful to namespace the logging so that it can besilenced easily. The biggest issue with logs/warnings is that theunhelpful reports drown out the helpful ones, and then the whole featurebecomes useless or worse.


Regards,

Luke

More experienced Django users that do not want this behaviour aregoing to know about a global setting and can opt in to the oldbehaviour rather easily. Newer users that do not know aboutselect/prefetch_related or these settings will fall into the newbehaviour by default.

It's unreasonable to expect every user of django learn the ins andouts of all queryset methods. I'm probably considered a django ormexpert, and I still sometimes write queries that are non-optimal or*become* non-optimal after changes in unrelated areas. At an absoluteminimum we should be screaming and shouting when this happens. But wecan also fix the issue while complaining, and help guide users intocorrect behaviour.



On Wednesday, 16 August 2017 08:41:31 UTC+10, Anthony King wrote:

    Automatically prefetching is something I feel should be avoided.

    A common gripe I have with ORMs is they hide what's actually
    happening with the database, resulting in
    beginners-going-on-intermediates building libraries/systems that
    don't scale well.

    We have several views in a dashboard, where a relation may be
    accessed once or twice while iterating over a large python
    filtered queryset.
    Prefetching this relation based on the original queryset has the
    potential to add around 5 seconds to the response time (probably
    more, that table has doubled in size since I last measured it).

    I feel it would be better to optimise for your usecase, as apposed
    to try to prevent uncalled-for behaviour.



    On Aug 15, 2017 23:15, "Luke Plant" <l.pla...@cantab.net
    <javascript:>> wrote:

        I agree with Marc here that the proposed optimizations are
        'magical'. I think when it comes to optimizations like these
        you simply cannot know in advance whether doing extra queries
        is going to a be an optimization or a pessimization. If I can
        come up with a single example where it would significantly
        decrease performance (either memory usage or speed) compared
        to the default (and I'm sure I can), then I would be strongly
        opposed to it ever being default behaviour.

        Concerning implementing it as an additional QuerySet method
        like `auto_prefetch()` - I'm not sure what I think, I feel
        like it could get icky (i.e. increase our technical debt), due
        to the way it couples things together. I can't imagine ever
        wanting to use it, though, I would always prefer the manual
        option.

        Luke



        On 15/08/17 21:02, Marc Tamlyn wrote:

        Hi Gordon,

        Thanks for the suggestion.

        I'm not a fan of adding a layer that tries to be this clever.
        How would possible prefetches be identified? What happens
        when an initial loop in a view requires one prefetch, but a
        subsequent loop in a template requires some other prefetch?
        What about nested loops resulting in nested prefetches? Code
        like this is almost guaranteed to break unexpectedly in
        multiple ways. Personally, I would argue that correctly
        setting up and maintaining appropriate prefetches and selects
        is a necessary part of working with an ORM.

        Do you know of any other ORMs which attempt similar magical
        optimisations? How do they go about identifying the cases
        where it is necessary?

        On 15 August 2017 at 10:44, Gordon Wrigley
        <gordon....@gmail.com <javascript:>> wrote:

            I'd like to discuss automatic prefetching in querysets.
            Specifically automatically doing prefetch_related where
            needed without the user having to request it.

            For context consider these three snippets using the
            Question & Choice models from the tutorial
            
<https://docs.djangoproject.com/en/1.11/intro/tutorial02/#creating-models> when
            there are 100 questions each with 5 choices for a total
            of 500 choices.

            Default
            |
            forchoice inChoice.objects.all():
            print(choice.question.question_text,':',choice.choice_text)
            |
            501 db queries, fetches 500 choice rows and 500 question
            rows from the DB

            Prefetch_related
            |
            forchoice inChoice.objects.prefetch_related('question'):
            print(choice.question.question_text,':',choice.choice_text)
            |
            2 db queries, fetches 500 choice rows and 100 question
            rows from the DB

            Select_related
            |
            forchoice inChoice.objects.select_related('question'):
            print(choice.question.question_text,':',choice.choice_text)
            |
            1 db query, fetches 500 choice rows and 500 question rows
            from the DB

            I've included select_related for completeness, I'm not
            going to propose changing anything about it's use. There
            are places where it is the best choice and in those
            places it will still be up to the user to request it. I
            will note that anywhere select_related is optimal
            prefetch_related is still better than the default and
            leave it at that.

            The 'Default' example above is a classic example of the
            N+1 query problem, a problem that is widespread in Django
            apps.
            This pattern of queries is what new users produce because
            they don't know enough about the database and / or ORM to
            do otherwise.
            Experieced users will also often produce this because
            it's not always obvious what fields will and won't be
            used and subsequently what should be prefetched.
            Additionally that list will change over time. A small
            change to a template to display an extra field can result
            in a denial of service on your DB due to a missing prefetch.
            Identifying missing prefetches is fiddly, time consuming
            and error prone. Tools like django-perf-rec
            <https://github.com/YPlan/django-perf-rec> (which I was
            involved in creating) and nplusone
            <https://github.com/jmcarp/nplusone> exist in part to
            flag missing prefetches introduced by changed code.
            Finally libraries like Django Rest Framework and the
            Admin will also produce queries like this because it's
            very difficult for them to know what needs prefetching
            without being explicitly told by an experienced user.

            As hinted at the top I'd like to propose changing Django
            so the default code behaves like the prefetch_related code.
            Longer term I think this should be the default behaviour
            but obviously it needs to be proved first so for now I'd
            suggest a new queryset function that enables this behaviour.

            I have a proof of concept of this mechanism that I've
            used successfully in production. I'm not posting it yet
            because I'd like to focus on desired behavior rather than
            implementation details. But in summary, what it does is
            when accessing a missing field on a model, rather than
            fetching it just for that instance, it runs a
            prefetch_related query to fetch it for all peer instances
            that were fetched in the same queryset. So in the example
            above it prefetches all Questions in one query.

            This might seem like a risky thing to do but I'd argue
            that it really isn't.
            The only time this isn't superior to the default case is
            when you are post filtering the queryset results in Python.
            Even in that case it's only inferior if you started with
            a large number of results, filtered basically all of them
            and the code is structured so that the filtered ones
            aren't garbage collected.
            To cover this rare case the automatic prefetching can
            easily be disabled on a per queryset or per object basis.
            Leaving us with a rare downside that can easily be
            manually resolved in exchange for a significant general
            improvement.

            In practice this thing is almost magical to work with.
            Unless you already have extensive and tightly maintained
            prefetches everywhere you get an immediate boost to
            virtually everything that touches the database, often
            knocking orders of magnitude off page load times.

            If an agreement can be reached on pursuing this then I'm
            happy to put in the work to productize the proof of concept.

--You received this message because you are subscribed to

            the Google Groups "Django developers (Contributions to
            Django itself)" group.
            To unsubscribe from this group and stop receiving emails
            from it, send an email to
            django-develop...@googlegroups.com <javascript:>.
            To post to this group, send email to
            django-d...@googlegroups.com <javascript:>.
            Visit this group at
            https://groups.google.com/group/django-developers
            <https://groups.google.com/group/django-developers>.
            To view this discussion on the web visit
            
https://groups.google.com/d/msgid/django-developers/d402bf30-a5af-4072-8b50-85e921f7f9af%40googlegroups.com
            
<https://groups.google.com/d/msgid/django-developers/d402bf30-a5af-4072-8b50-85e921f7f9af%40googlegroups.com?utm_medium=email&utm_source=footer>.
            For more options, visit
            https://groups.google.com/d/optout
            <https://groups.google.com/d/optout>.

--You received this message because you are subscribed to the

        Google Groups "Django developers (Contributions to Django
        itself)" group.
        To unsubscribe from this group and stop receiving emails from
        it, send an email to django-develop...@googlegroups.com
        <javascript:>.
        To post to this group, send email to
        django-d...@googlegroups.com <javascript:>.
        Visit this group at
        https://groups.google.com/group/django-developers
        <https://groups.google.com/group/django-developers>.
        To view this discussion on the web visit
        
https://groups.google.com/d/msgid/django-developers/CAMwjO1Gaha-K7KkefJkiS3LRdXvaPPwBeuKmhQv6bJFx3dty3w%40mail.gmail.com
        
<https://groups.google.com/d/msgid/django-developers/CAMwjO1Gaha-K7KkefJkiS3LRdXvaPPwBeuKmhQv6bJFx3dty3w%40mail.gmail.com?utm_medium=email&utm_source=footer>.
        For more options, visit https://groups.google.com/d/optout
        <https://groups.google.com/d/optout>.

--You received this message because you are subscribed to the

        Google Groups "Django developers (Contributions to Django
        itself)" group.
        To unsubscribe from this group and stop receiving emails from
        it, send an email to django-develop...@googlegroups.com
        <javascript:>.
        To post to this group, send email to
        django-d...@googlegroups.com <javascript:>.
        Visit this group at
        https://groups.google.com/group/django-developers
        <https://groups.google.com/group/django-developers>.
        To view this discussion on the web visit
        
https://groups.google.com/d/msgid/django-developers/a5780df6-ce60-05ae-88e3-997e6bc88f5c%40cantab.net
        
<https://groups.google.com/d/msgid/django-developers/a5780df6-ce60-05ae-88e3-997e6bc88f5c%40cantab.net?utm_medium=email&utm_source=footer>.
        For more options, visit https://groups.google.com/d/optout
        <https://groups.google.com/d/optout>.

--

You received this message because you are subscribed to the GoogleGroups "Django developers (Contributions to Django itself)" group.To unsubscribe from this group and stop receiving emails from it, sendan email to django-developers+unsubscr...@googlegroups.com<mailto:django-developers+unsubscr...@googlegroups.com>.To post to this group, send email todjango-developers@googlegroups.com<mailto:django-developers@googlegroups.com>.

Visit this group at https://groups.google.com/group/django-developers.

To view this discussion on the web visithttps://groups.google.com/d/msgid/django-developers/2f0b5932-1a38-4eaf-84aa-13960a303141%40googlegroups.com<https://groups.google.com/d/msgid/django-developers/2f0b5932-1a38-4eaf-84aa-13960a303141%40googlegroups.com?utm_medium=email&utm_source=footer>.

For more options, visit https://groups.google.com/d/optout.


--
You received this message because you are subscribed to the Google Groups "Django 
developers  (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to django-developers+unsubscr...@googlegroups.com.
To post to this group, send email to django-developers@googlegroups.com.
Visit this group at https://groups.google.com/group/django-developers.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-developers/9d359e4d-59d6-ff81-8ca6-db3bcd36ac0c%40cantab.net.
For more options, visit https://groups.google.com/d/optout.

Re: Automatic prefetching in querysets

Reply via email to