Re: Proposal: user-friendly API for multi-database support

Jan Oberst Thu, 11 Sep 2008 02:13:53 -0700

On Sep 10, 7:53 pm, Simon Willison <[EMAIL PROTECTED]> wrote:
> Dealing with single queries that span multiple databases
> --------------------------------------------------------
>
> Once you have different tables living in different databases there's
> always the chance that someone will try to write a query that attempts
> to join tables that live on two different database servers. I don't
> think we should address this problem at all (aside from maybe
> attempting to throw a descriptive error message should it happen) - if
> you're scaling across different servers you need to be aware of the
> limitations of that approach.
>
> That said, databases like MySQL actually do allow cross-database joins
> provided both databases live on the same physical server. Is this
> something we should support? I'd like to say "no" and assume that
> people who need to do that will be happy rolling their own SQL using a
> raw cursor, but maybe I'm wrong and it's actually a common use case.
>
> Connection pooling
> ------------------
>
> This is where I get completely out of my depth, but it seems like we
> might need to implement connection pooling at some point since we are
> now maintaining multiple connections to multiple databases. We could
> roll our own solution here, but to my knowledge SQLAlchemy has a solid
> connection pool implementation which is entirely separate from the
> rest of the SQLAlchemy ORM. We could just ensure that if someone needs
> connection pooling there's a documented way of integrating the
> SQLAlchemy connection pool with Django - that way we don't have an
> external dependency on SQL Alchemy for the common case but people who
> need connection pools can still have them.
>
> Backwards compatibility
> -----------------------
>
> I think we can do all of the above while maintaining almost 100%
> backwards with Django 1.0. In the absence of a DATABASES setting we
> can construct one using Django's current DATABASE_ENGINE /
> DATABASE_NAME / etc settings to figure out the 'default' connection.
> Everything else should Just Work as it does already - the only people
> who will need to worry are those who have hacked together their own
> multi-db support based on Django internals.
>


I think sharding really is something every developer would do
different because it's just so dependent on the actual business logic
and how your models work. That said maybe there's a few things lots of
people agree on and could build a base for more detailed
implementations.

At our (large-scale) project we chose Django because of the ORM and I
implemented a complete sharding setup on top of the ORM. The important
thing for me was to have quite a bit of magic when it comes to
handling the DB. Because you specify relations between data and even
specify how you use them (by building querysets) there's a lot of
information to draw conclusions from. This can lead to a simple and
plain interface. My goal was to have the same application code run on
a single-DB machine and a sharded DB environment without changing
anything.

The problem with sharding is, that lots of operations won't work. No
JOINs, no DB-side foreign keys, no transactions, no auto_increment
IDs... Another few won't work the way they are supposed to. ORDER BY,
LIMIT, COUNT and so on. But like some of you said that's just the way
it is. If you want to use sharding you should take care of those
exceptions yourself.

Our project is based on profiles. And our profile-based sharding
should put all data that's related to a profile on one shard. My
sharding config now tells the ORM that there's models that are always
on the same DB machine (photo sets and photos and comments for the
photos for example). Now once you have a photo object and trigger that
foreign key (photo.comments_set.all() example [1]) this query will go
directly to the shard the photo itself is on. Since we already know
that shard there's not even a need to look that up.

Sharding exceptions occour when the ORM tries something it can't
handle if the model is sharded. The great thing about this is, that we
can reduce the amount of use cases where we actually have to spawn a
query across multiple shards.

This won't work in every environment. I implemented this by hacking
quite a bit of Django source (mainly the descriptors of the ForeignKey
field). It sure would be neat to have a proper API for this. The real
problem with these techniques is not to implement them for youself.
But to implement them for everyone else. Since I'm the one responsible
for our ORM and models I can hack away just like this. Doing it
properly seems like a whole other story.

-Jan


[1] - A little bit of my model code:
This will put all Profile objects on shards 11-13 and establish
dependencies so that dependent objects and querysets know a priori
which shard to query.

class ProfileShardManager(ShardManager):
    # Tell the model which shards it belongs to.
    use_shards = [11,12,13]

    def initial_id_to_shard_mapping(self,id):
        # This is only the initial mapping. If a new profile is
created
        # we're doing round robin for all shards. Later we could
prefer
        # newly added servers for new profiles.
        return (11 + id % 3)

class Profile(Model):
    sharding = ProfileShardManager()

    email = EmailField()
    name = CharField()

class Photoset(Model):
    sharding        = ShardDependency(Profile)

    profile = ForeignKey(Profile)
    description = TextField()

class Photo(Model):
    sharding = ShardDependency(Profile,pooled_fields=['photoset'])

    profile = ForeignKey(Profile)
    photoset = ForeignKey(Photoset)
    title = CharField()

class PhotoComment(Model):
    sharding = ShardDependency(Profile,pooled_fields=['photoset'])

    profile = ForeignKey(Profile)
    photo = ForeignKey(Photo)
    text = TextField()

Django shell:
    >>> p = Photo.objects.get(pk=5)
        # Doing a lookup in the shard index table behind the scenes
        # to find the correct database for the Photo object with pk=5
    >>> p
    <Photo: #5>
    >>> p._shard
        # Every object knows where it lives
    <Shard 11>
    >>> p.photoset
        # Because of pooled_fields p passes along the shard 11
        # we know every related model has to be on shard 11
        # we only query shard 11 with that ID
    <Photoset: #1>
    >>> p.comments_set.all()
        # Same applies to reverse foreign keys
    [<PhotoComment: #1>,<PhotoComment: #2>,<PhotoComment:
#3>,<PhotoComment: #4>]

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en
-~----------~----~----~----~------~----~------~--~---

Re: Proposal: user-friendly API for multi-database support

Reply via email to