non-relational DB

Waldemar Kornewald Thu, 22 Oct 2009 04:36:46 -0700

Hi everyone,
this rather long mail contains a status report and instructions for
contributors and implementation notes for Django core developers. If
you only want to know the status you can stop after the first section.
If you want to contribute I hope this provides a good starting point
into our port.

---------------------------------------
Status report

We've got pretty far with our App Engine port. For example, the
sessions db and cached_db backends both work unmodified on App Engine.
You can also order results and use basic filter()s as supported by the
low-level App Engine API (gt, gte, lt, lte, exact, pk__in). You can
also use QuerySet.order_by(), .delete(), .count(), Model.save(),
.delete().

This is our second porting attempt (it's not in the old repository).
Our first attempt had too many conflicts with the multi-db branch
(esp. the one on github). This time we just hacked everything
together. We didn't concentrate on cleaning up the current backend
API. We've also disabled SQL support.

The next step is to move all the hacks into a nice backend API (at the
same time making sure that it won't conflict with multi-db) and
re-enable SQL support. That's where we need help. Also, if you want to
work on SimpleDB support this is the right time to join. The App
Engine backend itself can be handled by Thomas Wanschik and me -
contributions in this area are not absolutely necessary, so please
concentrate on the cleanup if you want to help.

Now to the details (for those who want to contribute).

---------------------------------------
Introducing QueryGlue

The old Django code was distributed across three layers:
* django.db.models.queryset.QuerySet
* django.db.models.sql.query.Query (from now on just sql.Query)
* backend

When a new QuerySet is instantiated (e.g. by calling
Model.objects.all()) it asks the backend for its Query class and then
creates an instance of that class. By default, this class is
sql.Query. Only the Oracle backend has its own Query which subclasses
sql.Query.

Normally, sql.Query builds the query on-the-fly. Whenever you call
QuerySet.filter(<filters>) the filters get put into a
Q(<filters>) and passed to
sql.Query.add_q( Q(...) ).
This function iterates over all filter rules in the Q object and calls
sql.Query.add_filter() for each individual filter.
This in turn directly modifies sql.Query.where which is a tree
structure that represents the WHERE clause. It already contains
information about the JOIN type for each filter (INNER, OUTER), the
fields that get referenced by the filter, the column and table
aliases, and so on. It already does a lot of what we need for
non-relational backends, but it's too SQL-specific.

The current behavior is also a problem for multi-db because it makes
too many assumptions about the storage format of the filter rules. The
user could call QuerySet.using(other_connection) anytime, so QuerySet
shouldn't really work with the low-level sql.Query class before it
actually executes the query.

We've solved this problem by introducing a backend-independent query
representation between QuerySet and the low-level Query (sql.Query,
appengine.Query, etc.). This representation is called QueryGlue. You
can find it in django.db.models.queryglue. It
provides almost exactly the same "public" API as sql.Query (so it can
easily be integrated with QuerySet). Each filter() call gets
translated into a tree structure that is inspired by sql.Query.where,
but it doesn't contain any information about the kind of JOIN.
Instead, it stores high-level important information like whether we're
filtering on a primary key, which columns and tables are involved in a
JOIN, etc.

---------------------------------------
The low-level Query class

Once the query needs to be executed (e.g., by calling .count() or by
iterating over the query) the QueryGlue instance creates a new
low-level Query instance which gets the QueryGlue as its only
parameter. Currently, the low-level Query class is hard-coded to
GAEQuery/BaseQuery in django.db.models.nonrelational.query.

Then, QueryGlue calls the Query's respective execution function
(results_iter(), count(), etc.). The
constructor only gets the QueryGlue instance. Then, we call the
respective execution function (results_iter(), count(), etc.) on the
instantiated low-level Query. Our GAEQuery can now iterate over all
filters in QueryGlue.filters and convert them to an App Engine Query
object.

---------------------------------------
subqueries

Instead of working with subquery classes we've added delete_bulk(),
insert(), etc. directly to QueryGlue and the low-level Query class. If
sql.Query really needs the current design those functions can still be
routed to the respective subquery instance, but on App Engine it's
easier to handle those operations in a separate function.

---------------------------------------
The cleanup

We made a few not-so-clean changes to Django itself. I've attached a
diff, so contributors can easily find all the changes we did to Django
(they're also commented with TODO and GAE):

............................
* disabled multi-table inheritance;
this could be emulated as described on the Django wiki
http://code.djangoproject.com/wiki/NonSqlBackends

See
django/db/models/base.py: line 147

............................
* disabled deletion of related objects in Model.delete() and QuerySet.delete()

See
django/db/models/query.py: lines 1036, 1065

............................
* replaced sql.subqueries.*Query usage with simple functions on a
single Query class (insert_or_update() instead of InsertQuery and
UpdateQuery)

See
django/db/models/query.py: lines 1058, 1088

............................
* commented out distinction between insert and update in
Model.save_base() because there's no such concept in App Engine (and
SimpleDB, AFAIK)

See
django/db/models/base.py: lines 470, 475

............................
The long-term goal is of course to clean this up and move most of
these changes into the backend API.

---------------------------------------
Common non-relational features

The plan is to add support for simple joins and select_related to all
non-relational backends by
either subclassing the backend's Query class on-the-fly with a
JoinQuery or by supporting something like query pre-processors which
can be added above the low-level Query class. We haven't thought about
the details, yet, but I hope you get the idea.

---------------------------------------
SQL layer details:

The ugly detail is that sql.subqueries contains specialized query
classes like InsertQuery, DeleteQuery, etc. which subclass the
backend's Query class. This means that currently, the module loading
process jumps around:
* sql/__init__.py imports sql.query and then sql.subqueries
* sql.query creates the base Query class
* after that, sql.query allows the backend to override the Query class
* sql.subqueries creates subclasses which derive from Query

In multi-db in SVN this is uglier because the subquery classes don't
have just one single sql.Query base class from which to derive,
anymore. There can be multiple backends, each with their own sql.Query
class, so the subqueries have to be maintained by the backend (with
some multi-inheritance magic and manual caching of the custom
subclasses).

In multi-db on github this is much cleaner: The backends can't
override sql.Query, anymore. Instead, there's an SQLCompiler class
which can be overridden by the backend to take care of
backend-specific details. sql.Query stores a slightly more abstract
representation of the query. This multi-db branch moves a lot of code
around. That's why we should try to keep as much code as possible
where it is (at least until the branch gets merged into trunk).

---------------------------------------
The source

The test project and our unit tests are here:
http://bitbucket.org/wkornewald/django-testapp/

The modified Django source and the backend is here:
http://bitbucket.org/wkornewald/django-nonrel-hacked/

We've patched the trunk branch. Unforunately, the branches are
unnamed (I converted the git mirror because the hg mirror's branches
on bitbucket are broken). You should be able to find the right branch
with "hg heads"
and "hg up -C" to it. Normally our branch should be at tip, anyway, so
you don't need to do anything.

When merging you need to find the trunk branch with "hg heads" and "hg
merge <revnum>" with the trunk head. If this becomes a huge problem
we'll switch to the django-trunk mirror, but I wanted to keep the
option to switch to Alex' multidb branch if that's better, so I chose
this sub-optimal Django mirroring solution.

---------------------------------------
Task management

Our tasks are managed in a Google Spreadsheet:
https://spreadsheets.google.com/ccc?key=0AnLqunL-SCJJdE1fM0NzY1JQTXJuZGdEa0huODVfRHc&hl=en

The task list isn't complete, yet. We're working on that.

Bye,
Waldemar Kornewald

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups
"Django developers" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/django-developers?hl=en
-~----------~----~----~----~------~----~------~--~---

diff -r 6e733173d200 django/db/models/base.py
--- a/django/db/models/base.py	Sat Oct 17 17:32:25 2009 +0000
+++ b/django/db/models/base.py	Thu Oct 22 11:28:17 2009 +0200
@@ -143,6 +143,11 @@
                                         (field.name, name, base.__name__))
             if not base._meta.abstract:
                 # Concrete classes...
+
+                # TODO: GAE: use polymodel instead
+                if True:
+                    raise TypeError("Multi-table inheritance isn't yet supported on App Engine")
+
                 while base._meta.proxy:
                     # Skip over a proxy class to the "real" base it proxies.
                     base = base._meta.proxy_for_model
@@ -462,20 +467,23 @@
             # First, try an UPDATE. If that doesn't update anything, do an INSERT.
             pk_val = self._get_pk_val(meta)
             pk_set = pk_val is not None
-            record_exists = True
+            # TODO: GAE: Clean up. Setting record_exists to False fakes that we don't
+            # distinguish between insert and update.
+#            record_exists = True
+            record_exists = False
             manager = cls._base_manager
-            if pk_set:
-                # Determine whether a record with the primary key already exists.
-                if (force_update or (not force_insert and
-                        manager.filter(pk=pk_val).extra(select={'a': 1}).values('a').order_by())):
-                    # It does already exist, so do an UPDATE.
-                    if force_update or non_pks:
-                        values = [(f, None, (raw and getattr(self, f.attname) or f.pre_save(self, False))) for f in non_pks]
-                        rows = manager.filter(pk=pk_val)._update(values)
-                        if force_update and not rows:
-                            raise DatabaseError("Forced update did not affect any rows.")
-                else:
-                    record_exists = False
+#            if pk_set:
+#                # Determine whether a record with the primary key already exists.
+#                if (force_update or (not force_insert and
+#                        manager.filter(pk=pk_val).extra(select={'a': 1}).values('a').order_by())):
+#                    # It does already exist, so do an UPDATE.
+#                    if force_update or non_pks:
+#                        values = [(f, None, (raw and getattr(self, f.attname) or f.pre_save(self, False))) for f in non_pks]
+#                        rows = manager.filter(pk=pk_val)._update(values)
+#                        if force_update and not rows:
+#                            raise DatabaseError("Forced update did not affect any rows.")
+#                else:
+#                    record_exists = False
             if not pk_set or not record_exists:
                 if not pk_set:
                     if force_update:
@@ -519,6 +527,8 @@
         pk_val = self._get_pk_val()
         if seen_objs.add(self.__class__, pk_val, self, parent, nullable):
             return
+        # TODO: GAE support deleting related objects in background task
+        return
 
         for related in self._meta.get_all_related_objects():
             rel_opts_name = related.get_accessor_name()
diff -r 6e733173d200 django/db/models/query.py
--- a/django/db/models/query.py	Sat Oct 17 17:32:25 2009 +0000
+++ b/django/db/models/query.py	Thu Oct 22 11:28:17 2009 +0200
@@ -14,6 +14,7 @@
 from django.db.models.fields import DateField
 from django.db.models.query_utils import Q, select_related_descend, CollectedObjects, CyclicDependency, deferred_class_factory
 from django.db.models import signals, sql
+from django.db.models.queryglue import QueryGlue
 
 
 # Used to control how many objects are worked with at once in some cases (e.g.
@@ -33,7 +34,7 @@
     """
     def __init__(self, model=None, query=None):
         self.model = model
-        self.query = query or sql.Query(self.model, connection)
+        self.query = query or QueryGlue(self.model, connection)
         self._result_cache = None
         self._iter = None
         self._sticky_filter = False
@@ -1032,20 +1033,21 @@
             for pk_val, instance in items:
                 signals.pre_delete.send(sender=cls, instance=instance)
 
-            pk_list = [pk for pk,instance in items]
-            del_query = sql.DeleteQuery(cls, connection)
-            del_query.delete_batch_related(pk_list)
-
-            update_query = sql.UpdateQuery(cls, connection)
-            for field, model in cls._meta.get_fields_with_model():
-                if (field.rel and field.null and field.rel.to in seen_objs and
-                        filter(lambda f: f.column == field.rel.get_related_field().column,
-                        field.rel.to._meta.fields)):
-                    if model:
-                        sql.UpdateQuery(model, connection).clear_related(field,
-                                pk_list)
-                    else:
-                        update_query.clear_related(field, pk_list)
+            # TODO: GAE: do this in a background task
+#            pk_list = [pk for pk,instance in items]
+#            del_query = sql.DeleteQuery(cls, connection)
+#            del_query.delete_batch_related(pk_list)
+#
+#            update_query = sql.UpdateQuery(cls, connection)
+#            for field, model in cls._meta.get_fields_with_model():
+#                if (field.rel and field.null and field.rel.to in seen_objs and
+#                        filter(lambda f: f.column == field.rel.get_related_field().column,
+#                        field.rel.to._meta.fields)):
+#                    if model:
+#                        sql.UpdateQuery(model, connection).clear_related(field,
+#                                pk_list)
+#                    else:
+#                        update_query.clear_related(field, pk_list)
 
         # Now delete the actual data.
         for cls in ordered_classes:
@@ -1053,16 +1055,17 @@
             items.reverse()
 
             pk_list = [pk for pk,instance in items]
-            del_query = sql.DeleteQuery(cls, connection)
+            del_query = QueryGlue(cls, connection)
             del_query.delete_batch(pk_list)
 
             # Last cleanup; set NULLs where there once was a reference to the
             # object, NULL the primary key of the found objects, and perform
             # post-notification.
             for pk_val, instance in items:
-                for field in cls._meta.fields:
-                    if field.rel and field.null and field.rel.to in seen_objs:
-                        setattr(instance, field.attname, None)
+                # TODO: GAE: do this in a background task
+#                for field in cls._meta.fields:
+#                    if field.rel and field.null and field.rel.to in seen_objs:
+#                        setattr(instance, field.attname, None)
 
                 signals.post_delete.send(sender=cls, instance=instance)
                 setattr(instance, cls._meta.pk.attname, None)
@@ -1082,6 +1085,5 @@
     the InsertQuery class and is how Model.save() is implemented. It is not
     part of the public API.
     """
-    query = sql.InsertQuery(model, connection)
-    query.insert_values(values, raw_values)
-    return query.execute_sql(return_id)
+    query = QueryGlue(model, connection)
+    return query.insert(values, raw_values, return_id)

non-relational DB

Reply via email to