GSoC Proposal: Serialization Enhancements

Russ Sun, 29 Mar 2009 15:38:16 -0700

My apologies for the length!


Concisely, I intend to provide the Django user with some granular
control of the data to be serialized without sacrificing backwards
compatibility for old code, or for users who need the straightforward,
current functionality.


Any serialized Model contains only the data in the source row of the
database table.  Regarding inheritance, this is only an issue when the
model in question derives from one or more concrete models.  In that
case, however, any data contained on the far end of the one-to-one
relation is not serialized at all.  The overarching issue is rooted in
the wayDjango serializers treat relationships.  This is clearly a
complex issue, because both shallow and deep serialization are useful
in a large range of situations.

For the purpose of argument and demonstration, please examine the
following use case.  A business wishes to begin providing their
products online, but they wish to continue using the inventory
management software with which they are familiar
(henceforth,ezInventory).  The models.py looks like:

from django.db import models

class Product(models.Model):
    name = models.CharField(max_length=200)
    description = models.TextField()
    # We're assuming all prices are in USD...
    price = models.DecimalField(max_digits=6, decimal_places=2)

class Order(models.Model):
    products = models.ManyToManyField(Product)
    order_placed = models.DateTimeField()

    def total_price(self):
        return self.products.aggregate(models.Sum('price'))

Django's serialization facilities make importing and exporting
products and orders between their website and ezInventory a breeze.
Six months later, however, they have a feature request.

They would like to begin providing support for some of their products
via subscription.  This is an easy addition:

# imports...
# original models...

class Subscription(Product):
    recurrence = models.PositiveIntegerField(choices=((12, 'annual'),
(1, 'monthly')))

Immediately, a problem is evident: the business relies on easy
communication of data between ezInventory and the website, but Django
serializes the class Subscription as follows:

[{
    "pk": 13,
    "model": "app.subscription",
    "fields": {
        "recurrence": 12
    }
}]

Now, this is wonderful, as long as we know to pair this information
with the product with id 13.  However, ezInventory, though supportive
of subscriptions in general, merely overwrites the product with id 13
(assuming the products table was imported first) and complains about
the missing fields.


Skipping merrily back to the real world, here is the issue, plain and
simple.  Sometimes, you need to include more information than is just
in the one model.  Sometimes, that's going to appear in the form of
inherited models, sometimes via one-to-many or many-to-many
relationships, and sometimes just little relevant bits of information
that may not actually be in the model at all.  There are numerous open
tickets about this issue [1][2][3][4].  Clearly, this needs some
thought.

I like the idea [5] of providing a Serializer class, defined similarly
to the ModelAdmin class, to allow custom ways of serializing data,
something like:

# in serializers.py (or something)
class ProductListing(serializers.Serializer):
    fields = {
        'Product': ['price', 'name', 'description'],
        'Subscription': ['price', 'name', 'description', 'recurrence']
    }

serializers.register(ProductListing)

$ python manage.py shell
>>> from project.app import models
>>> from django.core import serializers
>>> print serializers.serialize('json', list(models.Product.objects.all()) +
...     list(models.Subscription.objects.all()),
...     serializer='ProductListing', indent=4)
[{
    "pk": 1,
    "model": "app.product",
    "fields": {
        "price": 9.98,
        "name": "Product #1",
        "description": "Description for Product #1"
    }
},
# other products...
{
    "pk": 13,
    "model": "app.subscription",
    "fields": {
        "price": 14.98,
        "name": "Subscription #1, Product #13",
        "description: "My demonstrations aren't particularly
creative.",
        "recurrence": 12
    }
}]

Not only is this backwards-compatible (no new serializers == no new
behavior), but it also continues to allow deserialization: The
deserialization process would see that app.subscription is a child of
product, and fill in the data appropriately.  This circumvents one of
the largest drawbacks, IMO, ofDjangoFullSerializers [6]; being able to
deserialize this data is often just as important as serializing it
(lookin ' at you, fixtures).  The observant reader will note that, as
thus demonstrated, it only solves this issue for models using
inheritance, not those using one-to-many or many-to-many fields.  So,
let's serialize some orders.

# in serializers.py (or something)
class OrderSerializer(serializers.Serializer):
    fields = {
        # include values from the products, and the return value of
total_price
        'order': [{'products': ['price', 'name']}, 'total_price']
    }

serializers.register(OrderSerializer)

$ python manage.py shell
>>> from project.app import models
>>> from django.core import serializers
>>> print serializers.serialize('json', models.Order.objects.get(pk=1),
...     serializer='OrderSerializer', indent=4)
[{
    "pk": 1,
    "model": "app.order",
    "fields": {
        "products": [{
            "pk": 1,
            "model": "app.product",
            "fields": {
                "price": 9.98,
                "name": "Product #1"
            }
        },
        {
            "pk": 13,
            "model": "app.subscription",
            "fields": {
                "price": 14.98,
                "name": "Subscription #1, Product #13"
            }
        }],
        "total_price": 14.96
    }
}]

As long as enough data is presented, the deserializer stands no less
chance of accurately connecting models together than in its current
form.  Arguments could be provided to search for matching rows and
correct the primary keys from serialized data, or to update the fields
in the row of given primary key.

A couple of notes.  I think that providing 'excludes' behavior could
be done by prefixing a field name with a minus sign, i.e.:

    fields = {'order': ['-order_placed']}

If the only fields listed are prefixed with a minus sign, we can
assume that all remaining fields are to be included.  Mixing prefixed
and normal fields assumes implicit negations for any fields not
mentioned:

    fields = {
        'order': ['-order_placed', 'total_price']
    }

serializes only the total_price pseudo-field, ignoring both
order_placed, explicitly, and products, implicitly.  This is just
personal preference -- the decision needs to be made, and an 'exclude'
member to provide explicit field exclusions is just as sensible.

Providing a member to allow shortcut fields definition is a good
idea.  For example:

class OrderSerializer(serializers.Serializer):
# similar to QuerySet.select_related('product')...
    select_related = ['product']
# or, to mirror QuerySet.select_related(depth=1)...
    select_related = 1

This would use the default behavior, except the 'products' key in the
output would be list of the serialized products, which are serialized
in the default way.

More complex output can be produced by calling one serializer from
another:

class OrderSerializer(serializers.Serializer):
    fields = {
        'order': [{'products__via': 'ProductsListing'}, 'total_price']
    }

The 'via' keyword would cause the serializer to use the serialized
output of ProductsListing when called with the particular order's
products.all().

To be frank, deserialization is made much more complex with the last
few examples, and brings us back full circle to the data integrity
problems associated withdeserializing relational database data from a
flat file.  But, in the context of fixtures, it's typical that the
database is empty, and so thedeserializer can just dump all the data
in and maintain the relations (as is currently done).  In situations
[2] where the data is prepopulated, and pk values for relationships
may not be correct, perhaps a loaddata argument could be specified to
tell the deserialize to find the correct pk by searching for a row
that matches the serialized fields.  For ambiguous cases, it is
probably best to use human power instead ofDjango power.


Additional issues (some solved above, coincidentally), including the
addition of arbitrary fields to serialized data [7], the json
serializer's handling of gettext_lazy [8][9], and fields to be ignored
by loaddata [10] will also be fixed.

NB: Though all of my examples have used JSON, my proposal is of a
system that varies independently of serialization format.  I believe
that it's important to draw a line between format and that data to be
formatted, andDjango already provides the proper facilities to allow
custom serialization formats.

I believe that my proposal can be implemented in the 13-week GSoC time
frame as follows.  I like to work time-boxed, so everything is split
as such.  I would rather trim a task or two to be implemented later
than  have all tasks for a phase leak over into the next week.

Assume that every phase ends with documentation writing, regression
test creation/updates, and bug hunting.

Prior to 05/23 Discuss and build use cases to demonstrate goals, flesh
out API, last-minute project scoping
05/23 - 06/05 Code foundational serializers.Serializer base class,
where

class ProductSerializer(serializers.Serializer):
    pass

would act just as the current serializers.serialize(format,
Product.objects.all()), but with the new code structure.
06/06 - 06/12 Implement fields attribute (and excludes attribute, or '-
field' functionality) in cases where relations are not followed.
Includes child models, as in theProductsListing example class above.
06/13 - 06/26 Implement recursive serialization using the "__via"
syntax in fields.  Examine whether this format could be used
implicitly for deep serialization.
06/27 - 07/04 My extended family will be visiting, so any work this
week will reflect that listed for next week.
07/05 - 07/10 Bug hunting, documentation additions, finishing off
anything that didn't get finished in an earlier phase, regression
testing, write implementations of applicable use cases.
07/11 - 07/24 Provide implicit relation selecting via the fields
attribute.
07/25 - 07/31 Add agreed-upon flags to loaddata/dumpdata to allow
fixtures users to utilize new features.
08/01 - 08/10 Create documentation patches, bug hunt, improve code
quality [http://docs.djangoproject.com/en/dev/internals/contributing/
#coding-style], and begin community testing.
08/11 - 08/17 Communicate with community about unearthed bugs and
gotchas, and eliminate or prevent those issues.



References:
[1] => http://code.djangoproject.com/ticket/4656
[2] => http://code.djangoproject.com/ticket/7052
[3] => http://code.djangoproject.com/ticket/9422
[4] => http://code.djangoproject.com/ticket/10295
[5] => http://code.djangoproject.com/wiki/SummerOfCode2009#Ideas
[6] => http://code.google.com/p/wadofstuff/wiki/DjangoFullSerializers
[7] => http://code.djangoproject.com/ticket/5711
[8] => http://code.djangoproject.com/ticket/5590
[9] => http://docs.djangoproject.com/en/dev/topics/serialization/#id2
[10] => http://code.djangoproject.com/ticket/9279

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en
-~----------~----~----~----~------~----~------~--~---

GSoC Proposal: Serialization Enhancements

Reply via email to