Re: Proposal: user-friendly API for multi-database support

Robert Lofthouse Wed, 10 Sep 2008 17:27:31 -0700

Wow, i'm with Malcolm on the "steady on, let's take this a bit at a
time" :) Someone has been working hard, but that's a huge amount to
think about.


Reading through some of the comments: Most of the functionality
described here would be an explicit mapping and choice made by the
developer, it would be pretty hard to use any form of magic. I'm all
for the setting up of bindings in a settings.py, which you later
access and pass through whenever you do a select/update/delete. I'm
also for per-model connections/connection overrides, with an option to
override on a more general level at the (overseeing) application
level.

I'm against the idea of having functions in settings.py, but I don't
see why they can't be discovered based on definitions in the settings
file.

Will have a proper read through tomorrow, as this thread is
interesting.



On Sep 10, 6:53 pm, Simon Willison <[EMAIL PROTECTED]> wrote:
> For those who weren't at DjangoCon, here's the state of play with
> regards to multi-db support: Django actually supports multiple
> database connections right now: the Query class (in django/db/models/
> sql/query.py) accepts a connection argument to its constructor, and
> the QuerySet class (django/db/models/query.py) can be passed an
> optional Query instance - if you don't pass it one, it will create a
> Query that uses the default django.db.connection.
>
> As a result, if you want to talk to a different connection you can do
> it right now using a custom manager:
>
> class MyManager(Manager):
>
>     def get_query_set(self):
>         query = sql.Query(self.model, my_custom_db_connection)
>         return QuerySet(self.model, query)
>
> As Malcolm described it, he's provided the plumbing, now we need to
> provide the porcelain in the form of a powerful, user-friendly API to
> this stuff. Here's my first attempt at an API design.
>
> Requirements
> ============
>
> There are a number of important use-cases for multi-db support:
>
> * Simple master-slave replication: SELECT queries are distributed
>   between slaves, while UPDATE and DELETE statements are sent to
>   the master.
> * Different Django applications live on different databases, e.g.
>   a forum on one database while blog lives on another.
> * Moving data between different databases - it would be useful if
>   you could do this using the ORM to help paper over the
>   differences in SQL syntax.
> * Sharding: data in a single Django model is distributed across
>   multiple databases depending on some kind of heuristic (e.g. by
>   user ID, or with archived content moved to a different server)
> * Replication tricks, for example if you use MySQL's InnoDB for
>   your data but replicate to a MyISAM server somewhere so you can
>   use MySQL full-text indexing to search (Flickr used to do this).
>
> I've probably missed some; please feel free to fill in the gaps.
>
> We don't need to solve every problem, but we do need to provide
> obvious hooks for how those problems should be solved. Sharding, for
> example, is extremely application specific. I don't think Django
> should automatically shard your data for you if you specify 'sharding
> = True' on a model class, but it should provide documented hooks for
> making a custom decision on which database connection should be used
> for a query that allow sharding to be implemented without too much
> pain.
>
> Different applications on different databases on the other hand is
> something Django should support out of the box. Likewise, master-slave
> replication is common enough that it would be good to solve it with as
> few lines of user-written code as possible (it's the first step most
> people take to scale their database after adding caching - and it's a
> sweet spot for the kind of read-heavy content sites that Django is
> particularly suited for).
>
> Proposed API
> ============
>
> Here's my first attempt at describing a user-facing API.
>
> First, we need a way of specifying multiple database connections.
> Adrian has already expressed an interest in moving to DSNs rather than
> individual settings, so I suggest something like this:
>
> DATABASES = {
>     'default': 'mysql://foo:[EMAIL PROTECTED]/baz',
>
> }
>
> With multiple databases configured this could be:
>
> DATABASES = {
>     'master': 'mysql://foo:[EMAIL PROTECTED]/mydb',
>     'slave1': 'mysql://foo:[EMAIL PROTECTED]/mydb',
>     'slave2': 'mysql://foo:[EMAIL PROTECTED]/mydb',
>     'archive': 'mysql://foo:[EMAIL PROTECTED]/mydb',
>     'default': 'master',
>
> }
>
> There are two types of connection string - DSNs and aliases. A DSN
> contains '://' while an alias does not. Aliases can be used even
> within the DATABASES setting itself, as with 'default' in the above
> example.
>
> It should be possible to use a DSN that has not been defined in the
> DATABASES setting. As a result, I propose that anywhere in Django that
> accepts a connection alias should also accept a DSN or even a raw DB-
> API compliant connection object.
>
> The QuerySet.using() method
> ---------------------------
>
> Next, we need a way of telling Django which connection to use. I
> propose a new queryset method as the lowest level way of doing this,
> called 'using':
>
> qs = Article.objects.filter(published__lt = ...).using('archive')
>
> "using(alias_or_connection_or_dsn)" simply tells the QuerySet to
> execute against a different connection, by updating its
> internal .connection attribute.
>
> Other options for this method name include:
>
> with_db()
> with_connection()
>
> I preferred "using()" as it reads nicely and doesn't contain an
> underscore.
>
> using() represents the lowest level user-facing API. We can cover a
> common case (different applications on different databases) with the
> following:
>
> class Article(models.Model):
>     ...
>     class Meta:
>         using = 'articles'
>
> This means "use the articles connection for all queries originating
> with this model". I'm repurposing the term 'using' here for API
> consistency.
>
> Advanced connection selection
> -----------------------------
>
> All of the other above use-cases boil down to one key decision: given
> a particular database query, which database connection should I
> execute the query against?
>
> I propose adding a manager method which is called every time that
> decision is made, and which is designed to be over-ridden by advanced
> users. Here's the default implementation:
>
> class Manager:
>     ...
>     def get_connection(self, query):
>         from django.db import connection
>         return connection # Use the default connection for everything
>
> Here's an implementation which implements very simple master-slave
> replication:
>
> class Manager:
>     ...
>     def get_connection(self, query):
>         if isinstance(query, (InsertQuery, DeleteQuery, UpdateQuery)):
>             return 'master'
>         else:
>             return 'slave'
>             # Or if we have more than one slave:
>             return random.choice(['slave1', 'slave2']) # Footnote [1]
>
> The above would be even easier if InsertQuery, DeleteQuery and
> UpdateQuery were all subclasses of a ModificationQuery class (they are
> currently all direct subclasses of Query) - then the check could
> simply be:
>
>     if isinstance(query, ModificationQuery)
>
> We could even ship a MasterSlaveManager that implements a variant of
> the above logic in django.contrib.masterslave (more for educational
> and marketing purposes than because it's something that's hard to
> implement).
>
> Note that in my above example get_connection() methods one returns an
> actual connection object while the other returns a connection alias.
> This makes for a more convenient API, and is consistent with my above
> suggestion that DSNs, aliases and connection objects should be
> interchangeable.
>
> Since the get_connection method has access to the full query object,
> even complex sharding schemes based on criteria such as the individual
> fields being looked up in the query could be supported reasonably
> well.
>
> Dealing with single queries that span multiple databases
> --------------------------------------------------------
>
> Once you have different tables living in different databases there's
> always the chance that someone will try to write a query that attempts
> to join tables that live on two different database servers. I don't
> think we should address this problem at all (aside from maybe
> attempting to throw a descriptive error message should it happen) - if
> you're scaling across different servers you need to be aware of the
> limitations of that approach.
>
> That said, databases like MySQL actually do allow cross-database joins
> provided both databases live on the same physical server. Is this
> something we should support? I'd like to say "no" and assume that
> people who need to do that will be happy rolling their own SQL using a
> raw cursor, but maybe I'm wrong and it's actually a common use case.
>
> Connection pooling
> ------------------
>
> This is where I get completely out of my depth, but it seems like we
> might need to implement connection pooling at some point since we are
> now maintaining multiple connections to multiple databases. We could
> roll our own solution here, but to my knowledge SQLAlchemy has a solid
> connection pool implementation which is entirely separate from the
> rest of the SQLAlchemy ORM. We could just ensure that if someone needs
> connection pooling there's a documented way of integrating the
> SQLAlchemy connection pool with Django - that way we don't have an
> external dependency on SQL Alchemy for the common case but people who
> need connection pools can still have them.
>
> Backwards compatibility
> -----------------------
>
> I think we can do all of the above while maintaining almost 100%
> backwards with Django 1.0. In the absence of a DATABASES setting we
> can construct one using Django's current DATABASE_ENGINE /
> DATABASE_NAME / etc settings to figure out the 'default' connection.
> Everything else should Just Work as it does already - the only people
> who will need to worry are those who have hacked together their own
> multi-db support based on Django internals.
>
> Justification
> =============
>
> Why is get_connection() on Manager, not Query or QuerySet?
> ----------------------------------------------------------
>
> The logic that picks which database connection is used could live in
> three potential places: on the manager, on the QuerySet class or on
> the Query itself. The manager seems to me like the most natural place
> for this to live - users are already used to modifying the manager,
> it's trivial to swap in a different manager (e.g. a
> MasterSlaveManager) for a given model and the manager class gets to
> see all of the queries that go through it, including things like
> Article.objects.create(). If there are good reasons it should go on
> the Query or QuerySet instead I'd love to hear them.
>
> Why hand get_connection a Query rather than a QuerySet?
> -------------------------------------------------------
>
> Because when you call a model's .save() ...
>
> read more »
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en
-~----------~----~----~----~------~----~------~--~---

Re: Proposal: user-friendly API for multi-database support

Reply via email to