Proposal: use SQLAlchemy Core for query generation

Luke Plant Sat, 30 Jun 2012 07:22:26 -0700

Hi all,

A good while back I put forward the idea of using SQLAlchemy Core in
Django [1]. Having had more experience working with SQLAlchemy, I'm
putting that idea forward as a formal proposal, as I mentioned in a more
recent thread here.


Apologies in advance for the length! I've included a few 'TL;DR'
summaries and headings in the different sections which you might want to
scan first.

=== Proposal ===

We should re-write our query generation code to use SQLAlchemy Core.
This includes DDL queries as well as DML (e.g. CREATE TABLE as well as
SELECT, INSERT etc).

This would also involve replacing our database connection objects with
SQLAlchemy's. In this proposal, our high-level ORM, with model
definition and query API, would remain the same - we wouldn't be using
the SQLAlchemy ORM.

This is a "Django 2.0" proposal i.e. not immediate future, and not fully
backwards compatible.

=== Background - Django 2.0 philosophy ===

TL;DR: "evolution not revolution, some backwards incompatibilities"

Although we haven't really discussed the timing of Django 2.0, or its
philosophy, I think we should be doing. My own assumption is that it
should be evolution, not revolution, similar to Python 3, where we break
stuff that has dogged us in the past, and will hamper us going forward,
but don't make radical changes where not necessary. This proposal fits
that basic assumption.

Also, in Django to date we've eschewed external dependencies. That has
been partly due to the poor and confusing state of Python packaging,
which is hopefully improving. I think it will make us a very poor Python
citizen if we don't reverse this policy at some point, and Django 2.0 is
an obvious point to do it.

This proposal does not represent a move away from being a 'full stack'
framework. "Full stack" does not mean "no dependencies".

Our current recommended installation method is via pip [2], and we would
be doing our users a disservice to recommend otherwise. Installation via
pip actually makes the instructions shorter - manual methods require
things like removing old versions manually - and for anything beyond
trivial use of Django you have to know pip anyway.

So, with our recommended installation method, adding a dependency
doesn't make things any more difficult at all.

=== Background - ORM philosophy ===

TL;DR: "Let's make Django's DB layer the best it can be for relational
databases."

Whether we call it "the ORM" or "the model layer" as some people prefer,
I think it's fairly certain that the overwhelming majority of our users
are using relational databases.

Many of the things that make Django a compelling choice,
including the admin and re-usable apps, either don't work or are of
little use if you are not using a relational database.

So my philosophy is that we should aim to provide a really excellent
ORM that will take users as far as possible.

This doesn't preclude having non-relational support in Django. But
it seems very strange to make that the focus, when we've had
little-to-no support for it so far, or to allow that support to limit
how well we can cater for the 99%.

=== Motivation ===

== Motivation 1: Django's ORM leaves you high and dry when you reach its
limits.

While the ORM can do a surprising number of queries, there are plenty it
can't, and in all the medium-to-large projects I've worked on I've gone
beyond what the ORM can do.

At this point, you've got a few options, from easiest to hardest:

A) Do the aggregation/filtering etc in Python.

B) Write raw SQL.

C) Use SQLAlchemy or some other SQL generation tool.

D) Write a patch to extend the ORM.


None of these is great:

A) Data manipulation in Python, when it could be done in SQL, is
obviously a bad idea since it is usually very inefficient. But I've seen
a lot of code that does this, because it was hard/impossible to get
Django's ORM to do the query needed.

This anti-pattern will also give Django applications the reputation for
being slow. Obviously, we can point the finger at the developer, but if
we've made it hard for the developer to do the right thing, that is unfair.

B) Raw SQL fails if you have dynamic queries i.e. where the shape of the
query can vary.

Example 1: you are writing library code e.g. a re-usable app that knows
nothing about the tables it is actually querying, and may have been
given any arbitrary QuerySet as an input to manipulate in some way.

Example 2: even if you have full knowledge of the tables, you might have
additional WHERE clauses/JOINs/sub queries in some cases, that you want
to programmatically add to the query.

I've come across both these types in projects I've been involved in, and
I know I'm far from the only one.

Raw SQL can also fail if you are manually writing 'static' queries but
need compatibility with multiple DB backends.

C) For SQL generation, SQLAlchemy is the best, but for good reason it
comes with its own database connection objects. Having two sources of
connection objects causes problems, such as queries not seeing what was
done in the transaction established by the other connection.

This can be especially painful for tests even if you're not making much
use of transactions in your live system, since it leads to needing the
much slower TransactionTestCase and hacks to make it work.

D) This is basically too hard for most of us, including core developers.
One reason is that the the code involved is not in the best shape. (More
below. This might change if Anssi's work succeeds, but I still have big
doubts about how far we can push our ORM).


So, for these reasons, when you reach the end of what the ORM can
provide, you are stuck with a lot of poor or painful options. It sucks
to try to explain to a customer that although the feature they want is
relatively straightforward, and easily supportable by the major
component (the database), the limitations of the tool you are using make
it impractical. It *really* sucks when the tool in question is Django.

If SQLAlchemy was used for our query generation, however, it would be
easy to get a query object for 'the query so far' and start from there,
using the full power of SQLAlchemy Expression Language to build whatever
query you need.

== Motivation 2: More db backends

SQLAlchemy has a load of code to cope with different DB dialects, and we
have some too. This is largely overlapping, which is duplicated work.
It's not entirely overlapping though, and there are some people who
can't use one of Django/SQLAlchemy due to choice of DB. If we combined
work, both projects would benefit - SQLAlchemy would get a large number
of people now interested in fixing whatever backend support is missing
in their code, and we'd get all their existing backends for free.

Mike Bayer is enthusiastic about this possibility [3].

== Motivation 3: Code cleanup

As mentioned, our own query generation has been pushed to the limit and
beyond. (I'm talking about classes like Query, SQLCompiler). It has
grown and grown, so that the Query class is now a 2000 line behemoth,
containing a constructor with over 40 assignments to 'self'.

Most of the core developers are scared to touch this stuff, AFAICS,
myself included. It has no virtually no unit tests. While it has very
high test coverage, it is tested only by tests that check QuerySet and
other high-level APIs.

As such, it's very difficult to change, and it may well be beyond our
ability to successfully refactor.

Switching to SQLAlchemy would force us to rewrite this code, which is
for our own good. In addition, large chunks of it can be dropped
entirely (i.e. most of database specific stuff). This will reduce our
maintenance load going forward (eventually).

(BTW, I'm not saying that we should let the existing code continue to
rot, we should of course try to clean it up as best we can, and that
effort is not wasted - I'm talking about a longer term strategy here.
If we can refactor this code, great - this motivation can be dropped
from the list, but I think the others still stand).

== Motivation 4: Python community

Having another project as a dependency would be a good thing. It would
prevent Django becoming a walled garden, and give us greater exposure to
the broader talent in Python-world. SQLAlchemy itself is a fantastic
library, from which we can learn a lot.

=== Cost/risk for us ===

There is no doubt this would require a lot of work, but I think it is
achievable. Both SQLAlchemy and Django have connection objects that
'mostly' implement the DBAPI standard, which will help with the database
connection stuff.

Also, *not* doing this may represent a greater cost for us and our users
in the long run. There is no doubt that this layer of our code
represents a large amount of technical debt that we need to deal with at
some point.

We may be able to re-use a lot of the work that has already been done in
django-sqlalchemy [4], which is similar to this proposal in some ways,
or at least has done some of mapping.

It's possible that we won't be able to remove as much code as I'm
suggesting, leading to us having our own database-compatibility layer
on top of SQLA's, which might make our code ugly and harder to maintain
than doing it all ourselves.

For example, Oracle doesn't support LIMIT/OFFSET, and we have our own
workaround. SQLA also has a workaround that works transparently (I think
it is basically the same SQL), so we can delete all our code for this.
However, if there is a workaround we've got that SQLA doesn't have, or
doesn't apply transparently, or works differently to ours in an
important way, we might still need our own 'dialect' classes.  It may be
possible to get SQLA to accept patches for this in some cases, or use
SQLA's extension points to modify it's behaviour e.g. [5]. Mike Bayer
would definitely work with us on this (I've emailed him about it).

Finally, this change would probably require us to start helping out with
SQLAlchemy, at the very least investing enough to be able to effectively
contribute patches etc.


=== Implementation ===

(You can skip this section, it's just rough ideas I've come up with in
trying to assess the amount of work needed).

There are different strategies I can think of to break it down into
steps that will leave Django roughly working after each major step. Here
are some:

== Strategy 1:

The most direct, and in some ways the simplest, but requires a few huge
jumps.

1) Replace all our connection objects with SQLAlchemy's. Each individual
backend (sqlite, postgres etc) would return SQLA connection objects.
All SQL generation would still be done in Django.

2) Replace DDL SQL generation with SQLA Core.

3) Replace DML SQL generation with SQLA Core.

4) Replace backend/*/introspection.py

== Strategy 2:

1) Rewrite one of our backends (probably SQLite) so that it uses SQLA to
connect to SQLite. SQLite is a good choice because we tend to use it for
running the test suite fast, but not for production, maximising the
number of people who might be willing to try out this branch.

2) Rewrite its DDL SQL generation so it uses SQLA Expression Language.

3) Write a SQLCompiler class for SQLite, in the same way that we have a
MySQL/Oracle SQLCompiler, but rewriting *all* the methods, to use SQLA
Expression Language to build the queries.

4) Allow backends to provide their own QuerySet, and give SQLite its
own. The implementation delegates to a new Query class that uses SQLA to
build query objects.

5) One-by-one, migrate all the other backends so that they inherit from
the SQLite backend and classes, added fixes in subclasses where
necessary. Move the fixes down into the backend where possible.

6) Eliminate the whole 'backend' concept, pulling up the SQLite
implementation as the only implementation, renaming it etc, and removing
our own connection wrappers.

(Steps 3 and 4 might need to be combined/reversed).

== Strategy 3:

Same as strategy 2 except start by creating a 'SQLAlchemy' backend
(which can connect to any of the DBs via configuration). It might be
more confusing as it involves an extra layer of indirection. However, it
might be a better strategy, because you could actually develop this in a
branch that can be used in production, because just by a config change
you can use the new method. You could even merge this branch into master
regularly, which would encourage development.

Also, in this strategy, the first step - a SQLAlchemy backend that still
generates all queries the old way but has SQLAlchemy connection objects
- would be useful by itself. It would solve the problem of transactions
and a slow test suite for the case of a mixed Django/SQLA code base (see
above).

When the SQLAlchemy backend can fully pass the test suite for all
different DBs, and is using all SLQAlchemy stuff for SQL generation, it
will become the only backend, and the extra layer of indirection can be
removed, along with all the old code we are no longer using.


In all strategies, I haven't really thought about the contrib.gis and
the backends needed there. There are some SQLAlchemy libraries with GIS
extensions - e.g. GeoAlchemy [6]


=== Negative impact for users ===

There would be backwards incompatibilities for anyone using internals,
although quite a lot of them could be kept to a minimum with some shims
(probably temporary shims, which would be removed eventually). Some
documented APIs may be problematic e.g. QuerySet.extra(). These already
have various bugs, and it is possible that the re-write may expose the
bugs as being symptoms of fundamental flaws, rather than things we can
fix. We probably want to drop extra() for 2.0 anyway.

The level of incompatibilities, and the need for external dependency,
make me think that this is best suited for Django 2.0,


=== Further work or alternatives ===

This may or may not lead to adopting the SQLAlchemy ORM. This proposal
makes sense with or without that. That would be a much bigger change,
because it would change a lot publicly documented and highly used APIs,
and would introduce more subtle semantic changes due to things like
SQLA's Unit-of-Work etc.

I've emailed Mike Bayer about this proposal, and he is certainly happy
with the idea in general. He would want us to consider actually an
alternative, which is to use the SQLAlchemy ORM, not just Core, with
some backwards compatible wrappers for things like QuerySet. I wouldn't
be against that proposal, as I've heard great things about the SQLA ORM,
but I'm not in a position to recommend it myself, and it does introduce
some major semantic changes, even with a compatibility shim for syntax.

Another change this might lead to is getting rid of Q objects and F
objects, which are a poor man's way of referring to chunks of SQL, and
can at least be better implemented using SQLA's Expression Language, and
possibly removed altogether.

Finally, I should say thanks to Mike Bayer for the time he took just
looking at this proposal, and for his feedback, some of which I've
incorporated into this.


Comments? It would be good to get other developers' comments on the
whole idea of "Django 2.0" as well - when do we see it happening?


Regards,

Luke


[1] http://goo.gl/uKQe9
[2] http://goo.gl/gJP15
[3] http://goo.gl/NVRdw
[4] http://gitorious.org/django-sqlalchemy/
[5] http://goo.gl/4Vhlb
[6] http://geoalchemy.org/index.html

-- 
"Outside of a dog, a book is a man's best friend... inside of a
dog, it's too dark to read."

Luke Plant || http://lukeplant.me.uk/

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.

Proposal: use SQLAlchemy Core for query generation

Reply via email to