On Sep 10, 7:53 pm, Simon Willison <[EMAIL PROTECTED]> wrote: > Dealing with single queries that span multiple databases > -------------------------------------------------------- > > Once you have different tables living in different databases there's > always the chance that someone will try to write a query that attempts > to join tables that live on two different database servers. I don't > think we should address this problem at all (aside from maybe > attempting to throw a descriptive error message should it happen) - if > you're scaling across different servers you need to be aware of the > limitations of that approach. > > That said, databases like MySQL actually do allow cross-database joins > provided both databases live on the same physical server. Is this > something we should support? I'd like to say "no" and assume that > people who need to do that will be happy rolling their own SQL using a > raw cursor, but maybe I'm wrong and it's actually a common use case. > > Connection pooling > ------------------ > > This is where I get completely out of my depth, but it seems like we > might need to implement connection pooling at some point since we are > now maintaining multiple connections to multiple databases. We could > roll our own solution here, but to my knowledge SQLAlchemy has a solid > connection pool implementation which is entirely separate from the > rest of the SQLAlchemy ORM. We could just ensure that if someone needs > connection pooling there's a documented way of integrating the > SQLAlchemy connection pool with Django - that way we don't have an > external dependency on SQL Alchemy for the common case but people who > need connection pools can still have them. > > Backwards compatibility > ----------------------- > > I think we can do all of the above while maintaining almost 100% > backwards with Django 1.0. In the absence of a DATABASES setting we > can construct one using Django's current DATABASE_ENGINE / > DATABASE_NAME / etc settings to figure out the 'default' connection. > Everything else should Just Work as it does already - the only people > who will need to worry are those who have hacked together their own > multi-db support based on Django internals. >
I think sharding really is something every developer would do different because it's just so dependent on the actual business logic and how your models work. That said maybe there's a few things lots of people agree on and could build a base for more detailed implementations. At our (large-scale) project we chose Django because of the ORM and I implemented a complete sharding setup on top of the ORM. The important thing for me was to have quite a bit of magic when it comes to handling the DB. Because you specify relations between data and even specify how you use them (by building querysets) there's a lot of information to draw conclusions from. This can lead to a simple and plain interface. My goal was to have the same application code run on a single-DB machine and a sharded DB environment without changing anything. The problem with sharding is, that lots of operations won't work. No JOINs, no DB-side foreign keys, no transactions, no auto_increment IDs... Another few won't work the way they are supposed to. ORDER BY, LIMIT, COUNT and so on. But like some of you said that's just the way it is. If you want to use sharding you should take care of those exceptions yourself. Our project is based on profiles. And our profile-based sharding should put all data that's related to a profile on one shard. My sharding config now tells the ORM that there's models that are always on the same DB machine (photo sets and photos and comments for the photos for example). Now once you have a photo object and trigger that foreign key (photo.comments_set.all() example [1]) this query will go directly to the shard the photo itself is on. Since we already know that shard there's not even a need to look that up. Sharding exceptions occour when the ORM tries something it can't handle if the model is sharded. The great thing about this is, that we can reduce the amount of use cases where we actually have to spawn a query across multiple shards. This won't work in every environment. I implemented this by hacking quite a bit of Django source (mainly the descriptors of the ForeignKey field). It sure would be neat to have a proper API for this. The real problem with these techniques is not to implement them for youself. But to implement them for everyone else. Since I'm the one responsible for our ORM and models I can hack away just like this. Doing it properly seems like a whole other story. -Jan [1] - A little bit of my model code: This will put all Profile objects on shards 11-13 and establish dependencies so that dependent objects and querysets know a priori which shard to query. class ProfileShardManager(ShardManager): # Tell the model which shards it belongs to. use_shards = [11,12,13] def initial_id_to_shard_mapping(self,id): # This is only the initial mapping. If a new profile is created # we're doing round robin for all shards. Later we could prefer # newly added servers for new profiles. return (11 + id % 3) class Profile(Model): sharding = ProfileShardManager() email = EmailField() name = CharField() class Photoset(Model): sharding = ShardDependency(Profile) profile = ForeignKey(Profile) description = TextField() class Photo(Model): sharding = ShardDependency(Profile,pooled_fields=['photoset']) profile = ForeignKey(Profile) photoset = ForeignKey(Photoset) title = CharField() class PhotoComment(Model): sharding = ShardDependency(Profile,pooled_fields=['photoset']) profile = ForeignKey(Profile) photo = ForeignKey(Photo) text = TextField() Django shell: >>> p = Photo.objects.get(pk=5) # Doing a lookup in the shard index table behind the scenes # to find the correct database for the Photo object with pk=5 >>> p <Photo: #5> >>> p._shard # Every object knows where it lives <Shard 11> >>> p.photoset # Because of pooled_fields p passes along the shard 11 # we know every related model has to be on shard 11 # we only query shard 11 with that ID <Photoset: #1> >>> p.comments_set.all() # Same applies to reverse foreign keys [<PhotoComment: #1>,<PhotoComment: #2>,<PhotoComment: #3>,<PhotoComment: #4>] --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Django developers" group. To post to this group, send email to django-developers@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/django-developers?hl=en -~----------~----~----~----~------~----~------~--~---