Proposal: user-friendly API for multi-database support

Simon Willison Wed, 10 Sep 2008 10:53:25 -0700

For those who weren't at DjangoCon, here's the state of play with
regards to multi-db support: Django actually supports multiple
database connections right now: the Query class (in django/db/models/
sql/query.py) accepts a connection argument to its constructor, and
the QuerySet class (django/db/models/query.py) can be passed an
optional Query instance - if you don't pass it one, it will create a
Query that uses the default django.db.connection.


As a result, if you want to talk to a different connection you can do
it right now using a custom manager:

class MyManager(Manager):

    def get_query_set(self):
        query = sql.Query(self.model, my_custom_db_connection)
        return QuerySet(self.model, query)

As Malcolm described it, he's provided the plumbing, now we need to
provide the porcelain in the form of a powerful, user-friendly API to
this stuff. Here's my first attempt at an API design.

Requirements
============

There are a number of important use-cases for multi-db support:

* Simple master-slave replication: SELECT queries are distributed
  between slaves, while UPDATE and DELETE statements are sent to
  the master.
* Different Django applications live on different databases, e.g.
  a forum on one database while blog lives on another.
* Moving data between different databases - it would be useful if
  you could do this using the ORM to help paper over the
  differences in SQL syntax.
* Sharding: data in a single Django model is distributed across
  multiple databases depending on some kind of heuristic (e.g. by
  user ID, or with archived content moved to a different server)
* Replication tricks, for example if you use MySQL's InnoDB for
  your data but replicate to a MyISAM server somewhere so you can
  use MySQL full-text indexing to search (Flickr used to do this).

I've probably missed some; please feel free to fill in the gaps.

We don't need to solve every problem, but we do need to provide
obvious hooks for how those problems should be solved. Sharding, for
example, is extremely application specific. I don't think Django
should automatically shard your data for you if you specify 'sharding
= True' on a model class, but it should provide documented hooks for
making a custom decision on which database connection should be used
for a query that allow sharding to be implemented without too much
pain.

Different applications on different databases on the other hand is
something Django should support out of the box. Likewise, master-slave
replication is common enough that it would be good to solve it with as
few lines of user-written code as possible (it's the first step most
people take to scale their database after adding caching - and it's a
sweet spot for the kind of read-heavy content sites that Django is
particularly suited for).

Proposed API
============

Here's my first attempt at describing a user-facing API.

First, we need a way of specifying multiple database connections.
Adrian has already expressed an interest in moving to DSNs rather than
individual settings, so I suggest something like this:

DATABASES = {
    'default': 'mysql://foo:[EMAIL PROTECTED]/baz',
}

With multiple databases configured this could be:

DATABASES = {
    'master': 'mysql://foo:[EMAIL PROTECTED]/mydb',
    'slave1': 'mysql://foo:[EMAIL PROTECTED]/mydb',
    'slave2': 'mysql://foo:[EMAIL PROTECTED]/mydb',
    'archive': 'mysql://foo:[EMAIL PROTECTED]/mydb',
    'default': 'master',
}

There are two types of connection string - DSNs and aliases. A DSN
contains '://' while an alias does not. Aliases can be used even
within the DATABASES setting itself, as with 'default' in the above
example.

It should be possible to use a DSN that has not been defined in the
DATABASES setting. As a result, I propose that anywhere in Django that
accepts a connection alias should also accept a DSN or even a raw DB-
API compliant connection object.

The QuerySet.using() method
---------------------------

Next, we need a way of telling Django which connection to use. I
propose a new queryset method as the lowest level way of doing this,
called 'using':

qs = Article.objects.filter(published__lt = ...).using('archive')

"using(alias_or_connection_or_dsn)" simply tells the QuerySet to
execute against a different connection, by updating its
internal .connection attribute.

Other options for this method name include:

with_db()
with_connection()

I preferred "using()" as it reads nicely and doesn't contain an
underscore.

using() represents the lowest level user-facing API. We can cover a
common case (different applications on different databases) with the
following:

class Article(models.Model):
    ...
    class Meta:
        using = 'articles'

This means "use the articles connection for all queries originating
with this model". I'm repurposing the term 'using' here for API
consistency.

Advanced connection selection
-----------------------------

All of the other above use-cases boil down to one key decision: given
a particular database query, which database connection should I
execute the query against?

I propose adding a manager method which is called every time that
decision is made, and which is designed to be over-ridden by advanced
users. Here's the default implementation:

class Manager:
    ...
    def get_connection(self, query):
        from django.db import connection
        return connection # Use the default connection for everything

Here's an implementation which implements very simple master-slave
replication:

class Manager:
    ...
    def get_connection(self, query):
        if isinstance(query, (InsertQuery, DeleteQuery, UpdateQuery)):
            return 'master'
        else:
            return 'slave'
            # Or if we have more than one slave:
            return random.choice(['slave1', 'slave2']) # Footnote [1]

The above would be even easier if InsertQuery, DeleteQuery and
UpdateQuery were all subclasses of a ModificationQuery class (they are
currently all direct subclasses of Query) - then the check could
simply be:

    if isinstance(query, ModificationQuery)

We could even ship a MasterSlaveManager that implements a variant of
the above logic in django.contrib.masterslave (more for educational
and marketing purposes than because it's something that's hard to
implement).

Note that in my above example get_connection() methods one returns an
actual connection object while the other returns a connection alias.
This makes for a more convenient API, and is consistent with my above
suggestion that DSNs, aliases and connection objects should be
interchangeable.

Since the get_connection method has access to the full query object,
even complex sharding schemes based on criteria such as the individual
fields being looked up in the query could be supported reasonably
well.

Dealing with single queries that span multiple databases
--------------------------------------------------------

Once you have different tables living in different databases there's
always the chance that someone will try to write a query that attempts
to join tables that live on two different database servers. I don't
think we should address this problem at all (aside from maybe
attempting to throw a descriptive error message should it happen) - if
you're scaling across different servers you need to be aware of the
limitations of that approach.

That said, databases like MySQL actually do allow cross-database joins
provided both databases live on the same physical server. Is this
something we should support? I'd like to say "no" and assume that
people who need to do that will be happy rolling their own SQL using a
raw cursor, but maybe I'm wrong and it's actually a common use case.

Connection pooling
------------------

This is where I get completely out of my depth, but it seems like we
might need to implement connection pooling at some point since we are
now maintaining multiple connections to multiple databases. We could
roll our own solution here, but to my knowledge SQLAlchemy has a solid
connection pool implementation which is entirely separate from the
rest of the SQLAlchemy ORM. We could just ensure that if someone needs
connection pooling there's a documented way of integrating the
SQLAlchemy connection pool with Django - that way we don't have an
external dependency on SQL Alchemy for the common case but people who
need connection pools can still have them.

Backwards compatibility
-----------------------

I think we can do all of the above while maintaining almost 100%
backwards with Django 1.0. In the absence of a DATABASES setting we
can construct one using Django's current DATABASE_ENGINE /
DATABASE_NAME / etc settings to figure out the 'default' connection.
Everything else should Just Work as it does already - the only people
who will need to worry are those who have hacked together their own
multi-db support based on Django internals.

Justification
=============

Why is get_connection() on Manager, not Query or QuerySet?
----------------------------------------------------------

The logic that picks which database connection is used could live in
three potential places: on the manager, on the QuerySet class or on
the Query itself. The manager seems to me like the most natural place
for this to live - users are already used to modifying the manager,
it's trivial to swap in a different manager (e.g. a
MasterSlaveManager) for a given model and the manager class gets to
see all of the queries that go through it, including things like
Article.objects.create(). If there are good reasons it should go on
the Query or QuerySet instead I'd love to hear them.

Why hand get_connection a Query rather than a QuerySet?
-------------------------------------------------------

Because when you call a model's .save() method, a Query object is all
you get. That said, a QuerySet is a higher level API and it might
hence be easier to make a decision based on introspecting that than
introspecting Query. We could get the best of both worlds and define
the function signature as "get_connection(query, queryset=None)" -
passing in both arguments if a queryset is available and only the
first argument otherwise. Seems a bit ugly though.

I expect that some get_connection() methods will just call
query.as_sql() and make a decision based on the string that comes
back.

As an aside, since the get_connection() method is called for every
query it could actually double up as a hook for adding SQL logging or
profiling.

Another aside: maybe pick_connection(query) is a better method name?

Feedback?
=========

I haven't attempted to implement any of the above yet, but from
reading through the code it seems like it would be possible without a
great deal of alterations to Django. So... does this look feasible? Is
the API powerful enough to do what people need? Is this easy to
understand from an end-user point of view?

Cheers,

Simon

[1] - random.choice(['slave1', 'slave2']) is actually a poor way of
doing this, as we probably want to pick a single slave at the
beginning of each request and use that for all following queries. This
probably means using a threadlocal that is reset at the end of each
request via a signal, which is crufty but unavoidable.


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en
-~----------~----~----~----~------~----~------~--~---

Proposal: user-friendly API for multi-database support

Reply via email to