Re: GSoC Proposal: Serialization Enhancements

Russell Keith-Magee Wed, 01 Apr 2009 03:55:41 -0700

On Wed, Apr 1, 2009 at 2:10 PM, Russ <taal...@gmail.com> wrote:
>
> Questions for the time-pressed:
>
> * Have you ever needed, or can you conceive of ever wanting, to
> provide multiple formats (JSON/XML/etc) for the same data? In other
> words, is there a use case for easily producing different
> serializations of the same data?

Sure - a web API that honors the HTTP_ACCEPT header. Different
customers will want different output formats, depending on what they
are integrating with. Enterprisey types will tend to want XML; web2.0
types will want JSON. You want both as customers :-)

> * If you could serialize data in whatever structure you wanted, would
> you still need to deserialize it at some point, or is this type of use
> more unidirectional?

I suspect that any given serialization scheme is unidirectional in
practice. However, that doesn't mean that won't have use for a
deserialization scheme that can produce Django objects from some
external data source. It just won't necessarily be the same format
that you output.

> On Mar 31, 7:33 am, Russell Keith-Magee <freakboy3...@gmail.com>
> wrote:
>> On Tue, Mar 31, 2009 at 11:43 AM, Russ Amos <taal...@gmail.com> wrote:
>>
>> > Would writing an appropriate template, while certainly not ideal, provide
>> > most of the functionality for the common use case being discussed?
>> > [snipped]
...
> The docs feature instructions on producing CSV in this way [1], as an
> example.  Obviously, this is not ideal, but there's also something to
> be said for the flexibility, even if part of that is reinventing
> wheels.  This was more a rambling brainstorm than a useful part of my
> proposal...
>
> [1] => 
> http://docs.djangoproject.com/en/dev/howto/outputting-csv/#using-the-template-system

I'd leave out the ramble from your proposal then. Using templates for
serialization is one of those things that is possible, but not really
desirable. The base JSON/XML/YAML libraries provide all sorts of
facilities (such as special character escaping) and guarantees (e.g,
syntax correctness) that would be painful to implement or guarantee in
templates.

>> Some immediate concerns/questions:
>>
>>  * How do you deal with objects of different type? At present, you can
>> pass a disparate list of objects to the serializer. The only
>> requirement is that every element in the list is a Django object - it
>> doesn't need to be a homogeneous list.
>
> Initial thoughts are "throw an error if the attribute is missing", but
> I need time to consider a generic (read: useful) solution.

I'm not sure we're on the same page here. I'm talking about:

>>> serializers.serialize(list(Product.objects.all()) + 
>>> list(Company.objects.all()), format='json')

This is legal with the existing serializers, but the serializers you
have specified to date all seem to focus on homogeneous lists (i.e.,
the list only contains Products). How do you propose to handle
serialization of different object types? Ignoring attributes that
aren't explicitly listed sounds like a bit of a risky strategy - and
is potentially problematic - what if you want to display the name of a
product, but not the name of a Company?

>>  * How does this translate to non-JSON serializers? The transition to
>> YAML shouldn't be too hard, but what about XML? How does `structure`
>> get interpreted by the XML serializer? How do you differentiate
>> between the element name, element attributes, and child nodes that can
>> be used in XML serialization?
>
> This is what stands out to me most, now.  I realized after climbing
> into bed last night that I didn't even _consider_ XML, having
> previously written it off (in my original proposal) as format (and
> therefore irrelevant) since the focus was different.  Obviously XML is
> very different from JSON, and I am no longer sure that we can allow
> completely arbitrary serialization structure (which is the goal) AND
> maintain independence between structure and format, which I would like
> to do if at all possible.  I'm not sure if there's a realistic use
> case for being able to easily use one structure and multiple formats,
> however.  Boiling it down to the least common denominator seems
> limiting, but allowing complete flexibility could be quite coupling.

As stated above, there is at least 1 notable use case - web APIs.

Also - I see this more as targeting a greatest common factor, rather
than a lowest common denominator. XML is pretty much the most verbose
case - if you can cover XML, I'm confident you can hit JSON/YAML etc
by ignoring various features that the serializer has specified.
Consider the serialization of an individual field:

JSON: 'name': 'Fred'
XML: <value field='name' datatype='String'>Fred</name>

In order to render XML, you need to specify a name for each element
(which could be derived from the field itself). You also need to be
able to specify a dictionary of attributes for the tag (which could
also be derived from the field).

As long as the API provides a way to specify the element name and the
attribute dictionary (or a way to derived these values from a field),
you can render XML. The same information can be used to render the
JSON - it just gets ignored. The default serializer will obviously
provide values that reflect the default XML serializer; users of the
JSON serializer can safely ignore these defaults until they have a use
for XML serialization.

> The larger question, I suppose, is do we really want to be subclassing
> for structure and subclassing for format, or subclassing for structure
> and format?  The former provides a certain level of an "I wrote
> decoupled code" feeling, but, again I'm can't find a use case for
> this.  The latter feels restrictive if this use case ever does
> appear.  There's also something to be said for API uniformity...  Can
> a useful level of independence be achieved when the end formats are so
> different?

Yes, the end formats are very different - but on the other hand, they
are communicating the same information, in fundamentally the same way
(serialized strings, wrapped with annotating fruit). My aim is to
ensure that the common parts are factored out as common, while leaving
the ability to specify the distinct stuff as required.

Even if. in practice, you end up having different serialization
instructions for each format, the type of instructions that you need
to give are fundamentally the same. The easiest example of this that I
can forsee  is that JSON uses fields:{} to wrap around the field list,
whereas XML uses <values> around each element, but no wrapper around
fields themselves. This may mean two sets of slightly different
rendering instructions, but the type of instructions that are being
given (do I need to wrap the list of fields, and if so, what do I call
that structure) are common.

However, even in this situation - a single set of rendering
instructions should be sufficient to generate a complete, correct and
consistent parallel JSON, XML, YAML etc renderings. The only reason
for you to need multiple sets of rendering instructions is if you
actually have output formats in mind, and you can't accommodate your
desired outputs with a single set of instructions.

>> > Some "helpers" I think might be useful would be hooks for the various types
>> > of fields, including but not limited to relations, to allow things like
>> > special text processing or dependency traversal, and providing the current
>> > default "structure" in case the user simply wants to do some pre-processing
>> > of some form.
>>
>> I appreciate that this is one of those details that we will need to
>> finesse with time, but it would be interesting to hear your
>> preliminary thoughts on this - in particular, on how you plan to link
>> the string in the 'template' to the helper.
>
>
> Conversations about format complications notwithstanding, the actual
> serialization process I see as iterating through the structure
> attribute, converting keys to unicode, and processing the values as
> follows (loosely):
> - If the value is a list, and the key happens to be a relation field,
> loop through everything in the list with each of the objects in the
> relation.  There's a bit of a magic feel to this I don't like, so I've
> got an alteration to make below [3].
> - If the value is a string, follow conventions -- check if it's a
> field of the model, check if it's a method of the model, check if it's
> in the form "relation__field" (and "relation__relation__field" etc),
> check if it's a method of the serializer, and just default to "it must
> be just a string" in the end (although, might this be confusing for
> debugging?). Evaluate whatever it ends up being until it, too, is a
> string.
> - Tack on the value to the string produced, thus far, formatting as
> appropriate.
>
> [3] =V
> class ProductSerializer(serializers.Serializer):
>    structure = {
>        "name": "name",
>        "price": "price",
>        "description": "truncate_description"
>    }
>
>    def truncate_description(self, product):
>        return product.description[:40]
>
> class OrderSerializer(serializers.Serializer):
>    structure = {
>        "order_id": "pk",
>        "products": "products_list",
>        "total: "total_price"
>    }
>
>    def products_list(self, order):
>        products = order.products.all()
>        return [serializers.serialize(self._format, product,
> serializer=ProductSerializer) for product in products]
>
> I think this is a bit more realistic a use, eliminating the magical
> treatment of list elements, but isn't as ridiculously simple to
> write.  Now, you have to want it.  Thoughts?

It's a start. You may also want to think about how callables could be
used here in place of an attribute lookup strategy.

>> However, here's my brain dump, such as it is:
>
> I feel I should take a moment to thank you for taking many moments on
> critiquing my proposal and providing your insightful brain dumps, so I
> shall: thanks!

You're most welcome.

>> My initial thoughts was that the serializers would end up being a lot
>> like the Feeds framework - a base class with lots of
>> methods/attributes that can be overridden to provide specific
>> rendering behaviour. If you tear down the serialization problem, you
>> end up with a set of relatively simple questions:
>
> I've regrouped your observations so my observations make sense.
>
>>  * What is the top level structure (e.g.,, the outer [] in JSON, the
>> XML header and root tag)?
>>
>>  * What is the wrapping structure for each element in the list of
>> objects (e.g., the {} in JSON, the <object> tag in XML)
>>
>>  * How is that list of fields presented to the user? (fields:{} in
>> JSON, child elements in XML)
>
> The answers to these hinge on how flexible the custom serializers
> should be.  If we're okay with insisting on a little bit of basic
> format, we can allow the end-user more freedom with the structure.
> However, to provide real flexibility in, say, the additional aspects
> of XML serialization, I think we might have to force users to pick a
> serialization format, perhaps with the mention that changing between
> some formats is easier to do (JSON <-> YAML) than between others (JSON
> -> XML).

Again - just because a serializer has field_node_name='value', and the
XML serializer uses this to create <value></value> tags, it doesn't
mean the JSON serializer has to use that detail. It also doesn't mean
that the default XML instructions are the same as the default JSON
instructions.

>>  * How is each field rendered? (key-value string pairs? <value>
>> nodes?) If the field is itself a serializable object (e.g., another
>> Django object) how is it serialized?
>>
>>  * What descriptive attributes exist for each element in the list?
>> (pk, model name)
>>
>>  * How/where are these descriptive attributes rendered? ( dict
>> entries? root node attributes? child nodes?)
>>
>>  * Which fields (including extra fields, model properties, computed
>> fields, etc) should be included in the list of fields?
>>
>>  * Is there any optional metadata for each data field, such as
>> datatype? How is that optional metadata interpreted?
>
> I think these are all answered by the structures I've suggested, and
> the existing serializers do a decent job of this, already.  If the end-
> user wants to include metadata, he/she is welcome to do so.  The same
> can be same of extra fields, which fields, and how to format fields.
> If tweaking of a field is necessary, wrap the field in a method of the
> serializer:
>
> class MySerializer(serializers.Serializer):
>    structure = {
>        "field_name": "my_method"
>    }
>
>    def my_method(self, object):
>        return object.field_name + u"!"

I think you're on the right track here, specifics and details notwithstanding.

>> I was also thinking that you aren't necessarily going to be
>> subclassing the serializer itself. The answers to these questions are
>> really just rendering instructions that can be followed by any
>> serializer, once some common ground rules are established. The
>> existing serialization engin has a hard-coded set of answers; what we
>> need to do is refactor those answers out into a default definition
>> that can be subclassed, overridden, or rewritten to suit specific
>> needs.
>
> Yes, and on that point I want to again emphasize that I think there's
> something to be said for the difference in format and structure.  If
> the two can be kept separate, I would like to use a different name for
> the base instruction class than "Serializer", which I've been using to
> avoid bikeshed discussion on the name.  That said, I do think a
> different name for this would be nice.  serializers.Structure?
> serializers.Renderer?

This is a bikeshed. As long as you don't choose bright pink, if you
paint the bikeshed, you get to pick the color.

Yours,
Russ Magee %-)

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en
-~----------~----~----~----~------~----~------~--~---

Re: GSoC Proposal: Serialization Enhancements

Reply via email to