Proposal: allow optional user-defined functions to be passed into serializers.deserialize()

Wyley Fri, 25 Jan 2008 10:30:36 -0800

Hi everyone,

Below is an idea I've got for how to make the deserialization
interface a bit more flexible, since I and other users have expressed
a need for it.  I am willing to work on implementing this idea
(indeed, I will probably need to, even if it's just for my own
project), but I wanted to get some feedback from django-developers
first.  This is my first post to this list, so please bear with me.


First, some background, from an off-list exchange I had with Russell
Keith-Magee (thanks, Russell, for your reply):

==================================================
> My question is as follows:
>
> Ticket #3390 and the corresponding changeset (4610), which you
> submitted, address the issue of forward references in when
> deserializing data.  This means -- stop me if I'm wrong -- that if I
> have an Author object with a ForeignKey to Title, I can first
> deserialize the Author object, followed by the Title object, and so
> long as everything is wrapped in a transaction there will be no
> problems with the ForeignKey reference during the process.

Thats correct - with the noted exception that it doesn't work on MySQL
with InnoDB tables. InnoDB checks referential integrity on a
per-commit basis, even if you are within a transaction. MyISAM tables
works fine - but only because they don't have referential integrity
:-)

> I'm looking to take this behavior one step further, because I want to
> be able to export data from one database/Django installation and
> import it into another.  Namely, I want to assume that deserialized
> data is internally consistent, but that the primary keys have no
> meaning in the database I am importing it into.  In the Author/Title
> example, this means that after both objects are deserialized, I want
> to:
> 1.  optionally look to see if there is an existing Title record that
> matches the deserialized Title (based on non-pk criteria)
> 2.  if so, change the pk of the deserialized Title to match the
> existing record; if not, save the deserialized Title as a *new* record
> 3.  update the ForeignKey field on the deserialized Author with the
> new value for the deserialized Title's pk.
>
> Do you know if there are any plans to implement this kind of function
> in Django (or if it already exists and I've somehow missed it)?  If
> not, I'm looking at needing to code this myself; do you think this is
> a sufficiently valuable and general function that I should offer to
> contribute it to Django, and possibly get some help with it from more
> experienced developers?

It's not on my immediate to-do list, but ideas similar to this one
have been raised previously.

The slightly simpler use case is to import a fixture as a collection
of completely new objects, rather than overwriting existing objects.
At present, if your database contains a User with id 5 and your
fixture provides a user with id 5, the fixture will overwrite the
existing user 5. A common suggestion is to add a tag/flag/option to
allow me to import my fixture and have the fixture get added as user
ID 6 (or the next available id), avoiding the overwrite.

If I'm understanding your suggestion, you want something quite
similar, except that it would compare fixture data looking for an
existing match before creating a new record. I can't say I have much
of a use for this myself at present, but I can see the value it would
offer. If a solid, tested, documented implementation were to land in
my lap it would probably find itself in trunk.

However, the devil is in the detail. The 'load fixture as new data'
idea has been discussed a few times in the past on django-dev - while
the concept is relatively simple, coming up with a good implementation
is not. Your proposal adds additional complexity - effectively, how to
specify a query in a fixture.
=============================================

So, with this discussion in mind, here's my idea:  allow an optional
user-defined function (call it "map") to be passed to the
deserialize() function.  This function, if passed, will be called with
DeserializedObjects as an argument after they are constructed by a
particular deserializer but before they are returned by a particular
Deserializer's next() method -- so probably within the body of the
next() method itself, or as a decorator for it, though where exactly
the call is made is probably a point for further discussion.  (The
inspiration here is to give users something like a Ruby "block" for
the deserializer.)

Thus, one could make a call like this:
def some_function(ds_obj):
    # An admittedly silly example...
    ds_obj.message = "Hello!  I touched this."
    if some_condition(ds_obj): ds_obj.some_flag = True
    return ds_obj

ds_objs = serializers.deserialize('xml', f.read(), map=some_function)

for ds_obj in ds_objs:
    # Again, admittedly silly...but there's clearly much more that
could be done
    log_message(ds_obj.message)
    if ds_obj.some_flag:  do_something()
    else: do_something_else()


This approach has several advantages:

1.  It allows the user to insert arbitrary logic into the
deserialization process, which would add a lot of flexibility for
those who want it.  The passed function could be as simple as an
always_new() function which indiscriminately sets primary keys to
None, or as complex as a function that looks up existing records and
sets primary keys of deserialized data, flags, or whatever else
accordingly.

2.  It doesn't require a change to the serialization interface, or to
the way serializers output data.  In general, I don't think it's a
good idea to stick extra metadata into the output of
serializers.serialize().  A function for serializing/exporting data
cannot be expected to anticipate the needs of a foreign database; all
it can do is make sure the serialized data is internally consistent.
The process of deserializing/importing foreign data is the place to
make decisions about whether to save a record as new or overwrite an
existing one.

3.  It also doesn't require a change in the way users deserialize
data.  If you don't pass a map function, deserialize() just operates
as it always would.

4.  It shouldn't take much effort to implement.  In the simplest case,
it would just mean unpacking the "map" keyword argument in
serializers.base.Deserializer.__init__, and then adding an "if
callable(self.map): ..." construct in each deserializer's next()
method.

Of course, depending on where exactly the call to the user-defined map
function is made, it may be the case that this approach doesn't allow
the user to do anything that wasn't already possible with a for-loop
wrapped around the deserialize() function.  The main response I have
is that the benefit in this case is an organizational one:  it allows
the user to separate logic for dealing with the preparation and
importing of foreign data (i.e., the logic for *creating*
DeserializedObjects) from the logic for dealing with locally-formatted
but unsaved data (i.e., the logic for *saving* DeserializedObjects).
With the current deserialize() function, the user must do any extra
work associated with the preparation of foreign data after the
DeserializedObjects have already been created, which is especially
problematic if you're trying to import data that doesn't entirely
agree with your locally-defined models.

Anyway, this proposal obviously needs some further discussion, but
that's why I'm posting it here.  Thoughts?

Thanks!
Richard Lawrence




--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en
-~----------~----~----~----~------~----~------~--~---

Proposal: allow optional user-defined functions to be passed into serializers.deserialize()

Reply via email to