For those who weren't at DjangoCon, here's the state of play with regards to multi-db support: Django actually supports multiple database connections right now: the Query class (in django/db/models/ sql/query.py) accepts a connection argument to its constructor, and the QuerySet class (django/db/models/query.py) can be passed an optional Query instance - if you don't pass it one, it will create a Query that uses the default django.db.connection.
As a result, if you want to talk to a different connection you can do it right now using a custom manager: class MyManager(Manager): def get_query_set(self): query = sql.Query(self.model, my_custom_db_connection) return QuerySet(self.model, query) As Malcolm described it, he's provided the plumbing, now we need to provide the porcelain in the form of a powerful, user-friendly API to this stuff. Here's my first attempt at an API design. Requirements ============ There are a number of important use-cases for multi-db support: * Simple master-slave replication: SELECT queries are distributed between slaves, while UPDATE and DELETE statements are sent to the master. * Different Django applications live on different databases, e.g. a forum on one database while blog lives on another. * Moving data between different databases - it would be useful if you could do this using the ORM to help paper over the differences in SQL syntax. * Sharding: data in a single Django model is distributed across multiple databases depending on some kind of heuristic (e.g. by user ID, or with archived content moved to a different server) * Replication tricks, for example if you use MySQL's InnoDB for your data but replicate to a MyISAM server somewhere so you can use MySQL full-text indexing to search (Flickr used to do this). I've probably missed some; please feel free to fill in the gaps. We don't need to solve every problem, but we do need to provide obvious hooks for how those problems should be solved. Sharding, for example, is extremely application specific. I don't think Django should automatically shard your data for you if you specify 'sharding = True' on a model class, but it should provide documented hooks for making a custom decision on which database connection should be used for a query that allow sharding to be implemented without too much pain. Different applications on different databases on the other hand is something Django should support out of the box. Likewise, master-slave replication is common enough that it would be good to solve it with as few lines of user-written code as possible (it's the first step most people take to scale their database after adding caching - and it's a sweet spot for the kind of read-heavy content sites that Django is particularly suited for). Proposed API ============ Here's my first attempt at describing a user-facing API. First, we need a way of specifying multiple database connections. Adrian has already expressed an interest in moving to DSNs rather than individual settings, so I suggest something like this: DATABASES = { 'default': 'mysql://foo:[EMAIL PROTECTED]/baz', } With multiple databases configured this could be: DATABASES = { 'master': 'mysql://foo:[EMAIL PROTECTED]/mydb', 'slave1': 'mysql://foo:[EMAIL PROTECTED]/mydb', 'slave2': 'mysql://foo:[EMAIL PROTECTED]/mydb', 'archive': 'mysql://foo:[EMAIL PROTECTED]/mydb', 'default': 'master', } There are two types of connection string - DSNs and aliases. A DSN contains '://' while an alias does not. Aliases can be used even within the DATABASES setting itself, as with 'default' in the above example. It should be possible to use a DSN that has not been defined in the DATABASES setting. As a result, I propose that anywhere in Django that accepts a connection alias should also accept a DSN or even a raw DB- API compliant connection object. The QuerySet.using() method --------------------------- Next, we need a way of telling Django which connection to use. I propose a new queryset method as the lowest level way of doing this, called 'using': qs = Article.objects.filter(published__lt = ...).using('archive') "using(alias_or_connection_or_dsn)" simply tells the QuerySet to execute against a different connection, by updating its internal .connection attribute. Other options for this method name include: with_db() with_connection() I preferred "using()" as it reads nicely and doesn't contain an underscore. using() represents the lowest level user-facing API. We can cover a common case (different applications on different databases) with the following: class Article(models.Model): ... class Meta: using = 'articles' This means "use the articles connection for all queries originating with this model". I'm repurposing the term 'using' here for API consistency. Advanced connection selection ----------------------------- All of the other above use-cases boil down to one key decision: given a particular database query, which database connection should I execute the query against? I propose adding a manager method which is called every time that decision is made, and which is designed to be over-ridden by advanced users. Here's the default implementation: class Manager: ... def get_connection(self, query): from django.db import connection return connection # Use the default connection for everything Here's an implementation which implements very simple master-slave replication: class Manager: ... def get_connection(self, query): if isinstance(query, (InsertQuery, DeleteQuery, UpdateQuery)): return 'master' else: return 'slave' # Or if we have more than one slave: return random.choice(['slave1', 'slave2']) # Footnote [1] The above would be even easier if InsertQuery, DeleteQuery and UpdateQuery were all subclasses of a ModificationQuery class (they are currently all direct subclasses of Query) - then the check could simply be: if isinstance(query, ModificationQuery) We could even ship a MasterSlaveManager that implements a variant of the above logic in django.contrib.masterslave (more for educational and marketing purposes than because it's something that's hard to implement). Note that in my above example get_connection() methods one returns an actual connection object while the other returns a connection alias. This makes for a more convenient API, and is consistent with my above suggestion that DSNs, aliases and connection objects should be interchangeable. Since the get_connection method has access to the full query object, even complex sharding schemes based on criteria such as the individual fields being looked up in the query could be supported reasonably well. Dealing with single queries that span multiple databases -------------------------------------------------------- Once you have different tables living in different databases there's always the chance that someone will try to write a query that attempts to join tables that live on two different database servers. I don't think we should address this problem at all (aside from maybe attempting to throw a descriptive error message should it happen) - if you're scaling across different servers you need to be aware of the limitations of that approach. That said, databases like MySQL actually do allow cross-database joins provided both databases live on the same physical server. Is this something we should support? I'd like to say "no" and assume that people who need to do that will be happy rolling their own SQL using a raw cursor, but maybe I'm wrong and it's actually a common use case. Connection pooling ------------------ This is where I get completely out of my depth, but it seems like we might need to implement connection pooling at some point since we are now maintaining multiple connections to multiple databases. We could roll our own solution here, but to my knowledge SQLAlchemy has a solid connection pool implementation which is entirely separate from the rest of the SQLAlchemy ORM. We could just ensure that if someone needs connection pooling there's a documented way of integrating the SQLAlchemy connection pool with Django - that way we don't have an external dependency on SQL Alchemy for the common case but people who need connection pools can still have them. Backwards compatibility ----------------------- I think we can do all of the above while maintaining almost 100% backwards with Django 1.0. In the absence of a DATABASES setting we can construct one using Django's current DATABASE_ENGINE / DATABASE_NAME / etc settings to figure out the 'default' connection. Everything else should Just Work as it does already - the only people who will need to worry are those who have hacked together their own multi-db support based on Django internals. Justification ============= Why is get_connection() on Manager, not Query or QuerySet? ---------------------------------------------------------- The logic that picks which database connection is used could live in three potential places: on the manager, on the QuerySet class or on the Query itself. The manager seems to me like the most natural place for this to live - users are already used to modifying the manager, it's trivial to swap in a different manager (e.g. a MasterSlaveManager) for a given model and the manager class gets to see all of the queries that go through it, including things like Article.objects.create(). If there are good reasons it should go on the Query or QuerySet instead I'd love to hear them. Why hand get_connection a Query rather than a QuerySet? ------------------------------------------------------- Because when you call a model's .save() method, a Query object is all you get. That said, a QuerySet is a higher level API and it might hence be easier to make a decision based on introspecting that than introspecting Query. We could get the best of both worlds and define the function signature as "get_connection(query, queryset=None)" - passing in both arguments if a queryset is available and only the first argument otherwise. Seems a bit ugly though. I expect that some get_connection() methods will just call query.as_sql() and make a decision based on the string that comes back. As an aside, since the get_connection() method is called for every query it could actually double up as a hook for adding SQL logging or profiling. Another aside: maybe pick_connection(query) is a better method name? Feedback? ========= I haven't attempted to implement any of the above yet, but from reading through the code it seems like it would be possible without a great deal of alterations to Django. So... does this look feasible? Is the API powerful enough to do what people need? Is this easy to understand from an end-user point of view? Cheers, Simon [1] - random.choice(['slave1', 'slave2']) is actually a poor way of doing this, as we probably want to pick a single slave at the beginning of each request and use that for all following queries. This probably means using a threadlocal that is reset at the end of each request via a signal, which is crufty but unavoidable. --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Django developers" group. To post to this group, send email to django-developers@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/django-developers?hl=en -~----------~----~----~----~------~----~------~--~---