[GSoC] Proposal for discussion about Serialization requirements and requesting for Review

Madhusudan C.S Thu, 26 Mar 2009 09:48:46 -0700

Hi all,
    After some discussions with Malcolm on this list and doing some
research based on the pointers he gave me I have come up with a
rough plan of what I want to do this summer for Django. Since we
are running out of time, I have come up with a *rough draft* of the
proposal without full discussion with the Django community about the
features that can be implemented. So this is in no way a *Complete
Proposal* and I don't want to submit until some discussion on this
happens really. Also the required proposal format asks to put the
links of the devel list discussions that led to the proposal, which I don't
have except Malcolm's mails. So I kindly request you all to review my
proposal thoroughly and suggest me what I can add or subtract from
the proposal. If my propositions and assumptions are true and how I
can correct myself, so that I can submit my proposal to Google.


*Note: *
  Django doesn't serialize inherited Model fields in the Child Model. I
asked
on IRC why this decision was taken but got no response. I searched the
devel list too, but did not get anything on it. I want to add it to my
proposal, but before doing it I wanted to know why this decision was
taken. Will it be a workable and necessary solution to add that to my
proposal?
Same is the case for Ticket #10201. Can someone please tell me why
microsecond data was dropped?

  Also I am leaving adding extras option to serializers since a patch for it

has already been submitted(Ticket #5711) and looks like a working
solution. If you all want something extra to be done there to
commit it to django trunk, please tell me, I will work on that a bit
and add it to the proposal.

Here is my long long long proposal:

Title: Restructuring of existing Serialization format and improvisation of
APIs

~~~~~~~~~
Abstract
~~~~~~~~~

Greetings!

   I wish to provide Django, a better support for Serialization by building
upon the
existing Serialization framework. This project includes extending the format
of the
Serialized output that existing Serializer produces by allowing in-depth
traversal of
Relation Fields in a given Model. The project also includes extending the
existing API
to specify the depth of the relations to be serialized, the name of the
related model
to be serialized. The API also provides for backwards compatibility to allow
older
versions of serialized output to work with the to-be introduced changes. All
the
changes will be made keeping in mind 2 important things.
   1. All the changes should be backwards compatible (can only break when a
very
     important requirement that improves the serialization by many folds
cannot be
     implemented without making backwards incompatible changes and django
     community gives a GO Green signal for doing so).
   2. The serialized data should be useful not just for use withing Django
apps but
     also for exporting the data for external use and processing.

~~~~~~~
Why?
~~~~~~~

- The existing format of the serialized output firstly doesn't specify the
name of the
  Primary Key(PK henceforth), which is a problem for fields which are
implicitly set
  as PKs (Ticket #10295).
- The existing format only specifies the PK of the related field, but
doesn't traverse it
  in depth to specify its fields (Ticket #4656).
- There are no APIs for the above said requirement.
- The inherited models fields are not serialized.

Situations/problems arising from attempting to fix the above problems
- When we allow Serialization to follow relations, it becomes unnatural if
  the related Model is included in every relating model data. The data
  becomes extremely redundant. Consider the following example.

  class Poll2(models.Model):
      question = models.CharField(max_length=200)
      pub_date = models.DateTimeField('date published')

      def __unicode__(self):
          return self.question


  class Choice2(models.Model):
      poll = models.ForeignKey(Poll)
      choice = models.CharField(max_length=200)
      votes = models.IntegerField()

      def __unicode__(self):
          return self.choice

  The serializing Choice2 Model might look something like below if we allow
following-of-Relations:
[
    {
        "pk": 1,
        "model": "testapp.choice2",
        "fields": {
            "votes": 1,
            "poll": [
                {
                    "pk": 1,
                    "model": "testapp.poll2",
                    "fields": {
                        "question": "What's Up?",
                        "pub_date": "2009-03-01 06:00:00"
                    }
                }
            ]
            "choice": "Django"
        }
    },
    {
        "pk": 2,
        "model": "testapp.choice2",
        "fields": {
            "votes": 2,
            "poll": [
                {
                    "pk": 1,
                    "model": "testapp.poll2",
                    "fields": {
                        "question": "What's Up?",
                        "pub_date": "2009-03-01 06:00:00"
                    }
                }
            ]
            "choice": "Python"
        }
    },
    {
        "pk": 3,
        "model": "testapp.choice2",
        "fields": {
            "votes": 4,
            "poll": [
                {
                    "pk": 1,
                    "model": "testapp.poll2",
                    "fields": {
                        "question": "What's Up?",
                        "pub_date": "2009-03-01 06:00:00"
                    }
                }
            ]
            "choice": "Others are useless"
        }
    }
]
  which clearly shows the redundant Poll data. Here we are serializing
Choice2, of
  course, but that doesn't mean Serializing Polls will give the natural
serialized
  output. In fact serializing Poll doesn't give anything pertaining to
Choice Model
  instance. A more natural serialization should result from Serializing a
Poll Model
  instance which includes within itself all the Choice Model instances that
are
  related to it. This is an obvious consequence of how Database schemas are
  designed by applying Normalization rules.

- The way loaddata and dumpdata are handled is changed. The new version of
this
  loaddata and dumpdata may not be compatible with the fixtures generated
from
  older versions.

Most of the above said problems have been addressed in the tickets
specified, but
the patches need to be dealt more thoroughly after discussing with the
Django
community in general. So design decisions need to be taken for fixing most
of the
tickets(which I will do in community bonding phase).

~~~~~~~
How?
~~~~~~~

  The project begins with implementing a version-id field in the serialized
output. This
field is provided for backwards compatibility. Then it proceeds by
converting the existing
PK field which appears as
{
    "pk": 1,
    "model": "testapp.choice2",
    #...
to serialize the name of the PK field. I propose it to be presented as:
{
    "pk": {
        "id": 1
    },
    "model": "testapp.choice2",
    #...
  This change is being proposed keeping in mind that David Crammer's patches
for
Ticket #373 gets into Django trunk sometime or the other, since it should
happen as
it is a long standing requirement. This representation allows for multiple
PK fields to
exist in the model and be serialized correctly.

  The corresponding changes in the deserializers to process this data will
also be made
at this stage. The implementation touches the following parts of Django:
django.core.serializers.python.Serializer.end_object()
django.core.serializers.xml_serializer.start_serialization() [It already
implements version.]
and related methods and files.

  The project proceeds by splitting the serializer into 2 versions to handle
the older
version and this current version of the serialized output. The decision as
to which
version of the serializer to use will be taken by adding an API option
"old_version=True"
parameter to serialize method. The deserialize method can however decide
this by
looking at the new version-id. Also options for django-admin.py loaddata and
dumpdata
commands will be provided with --old_version.

  The second phase, the biggest phase, starts by implementing serializing of
relations in
depth. The APIs will be implemented for these things hand-in-hand as the
features are
being implemented. An API to specify, what relations to serialize, will be
provided with
"relations=(rel1, rel2, ...)" parameter to serialize. Also a parameter to
specify
"relation_depth=(N1, N2, ...)" will be provided to serialize the related
models recursively
till the specified depth N. Skipping "relations=" implies to serialize all
the related models
in a given model and skipping "relation_depth=" implies serializing to full
depth. Skipping
both serializes just the PK of the related models(old style). Further
selection of fields in
the individual related models to be serialized is provided with a
DjangoFullSerializers like
syntax, using dictionaries. An exclude fields option will be given similar
to
DjangoFullSerializers.
Link to DjangoFullSerializers:
http://code.google.com/p/wadofstuff/wiki/DjangoFullSerializers

  This phase proceeds by providing the API optional parameter,
"reverse_relation=[rel1,
rel2]" within a Related Model(Poll2 in the example), rather than the Model
that relates
to this model(Choice2). This does a reverse relation look up and for each
Related Model
instance it serializes all the reverse relations that relate to this model
instance which
solves the above said problem of data redundancy. The output looks something
like below
if serialized as: serializers.serialize('json', Poll2.objects.all(),
reverse_relation=('choice2'))
[
    {
        "pk": 1,
        "model": "testapp.poll2",
        "fields": {
            "question": "What's Up?",
            "pub_date": "2009-03-01 06:00:00"
        }
        "testapp.choice2": [
            {
                "pk": 2,
                "model": "testapp.choice2",
                "fields": {
                    "votes": 2,
                    "choice": "Python"
                 }
            },
            {
                "pk": 2,
                "model": "testapp.choice2",
                "fields": {
                    "votes": 4,
                    "choice": "Django"
                }
            }
        ]
    }
]

  This becomes extremely useful when we are exporting data for external
processing.
As far as deserializers are concerned, in this case, they process the data
to see if
they have any other app.modelname in the serialized data outside the fields
dictionary,
and if they exist are considered as reverse_relation data and constructs
both the Poll2
Model objects and Choice Model objects. Calling save() should recursively
save all the
instances. This implementation may not be as easy as it looks. It requires a
lot of design
decisions to be taken before implementing these changes.

  The above said implementation requires making changes to
serializers.base.Serializer.serialize method to handle new added parameters.
Reverse
lookups will be added here. Relation in-depth serialization will also be
taken care by
possibly adding new methods in the Base class, to return the required data.
These
methods recursively return data of multi-level relations by possibly
"yield"ing. The
DeserializedObject.reversed_objects is added to contain a list of reverse
relation instances.
The <format>.Deserializers will also construct the Model objects by taking
into
account only the current model fields but not the related model fields. It
just uses
the PK field from such related model data.

  The loaddata and dumpdata fixtures will be optionally allowed to use
reverse_relations
by giving the option --natural. This helps to dump the data with least
redundancy for
exporting.

~~~~~~~~~~~~~~~~~
Benefits to Django
~~~~~~~~~~~~~~~~~
  By the end of this project, Django will have a better support for
Serialization. It
supports much requested feature of in-depth Serializations thereby fixing
ticket #4656.
It also fixes #10295. Fixtures and Serialized data become more convenient
for use in
Django and externally by reducing Data Redundancy. And finally better API
support
for all the newly introduced features. The serialized data is made more
generic
keeping in mind the possible future additions like multiple PK support and
backwards
compatibility.

~~~~~~~~~~~~
Deliverables
~~~~~~~~~~~~
  1. Internal implementation and code for in-depth serialization, reverse
relation
    serialization and additional fields.
  2. Additional APIs to support in-depth serialization, to specify relation
depth for
    serialization, support for PK field name in the Serialized output and
version id.
  3. Also APIs for reverse relations serialization.
  4. Additional options to loaddata and dumpdata commands.
  5. Test Cases for all the newly introduced features.

  Non-Code deliverables include testing performed at 3 different phases to
verify the
correctness and backwards compatibility. Also detailed user and development
documentation for using the new Serializer implementations.

~~~~~~~~
When?
~~~~~~~~

  The project is planned to be completed in 9 phases. Every phase includes
documenting
the progress during that phase. The timeline for each of these phases is
given below:
  1. Design Decisions and Initial preparation(Community Bonding Period :
Already started -
      May 22nd )
        Closely working with Django community to learn more about Django in
depth,
        learning code structure of Django, reading documentations related to
Django
        internals, reading and understanding the code base of ORM and
Serializers
        in depth, reading about other system's Serializers. Communicating
and discussing
        with the community about the outstanding issues to resolve the
accepted
        tickets. Design decisions I propose are discussed and finalized.

  2. Finalizing Design and Coding Phase I (May 22th – May 31st )
        Discussions with Django community in general and my mentor to
finalize the
        design desicions for the major portion of the project. Documenting
the design
        decision. Implementing Version-id, PK changes in Serializers and
implementing
        deserializers to parse the same. Serializers and deserializers will
be split
        to handle both the versions(old and new).

  3. Testing Phase I (June 1st – June 5th )
        Writing new test cases and adjusting the existing test cases to make
sure
        the phase I changes don't break Django in anyway.

  4. Coding Stage II (June 6th – June 21st )
        Serializing relations in-Depth will be implemented in this phase,
also the
        corresponding APIs will be added as mentioned in the Details
section. Changes
        and additions will be made to both serializers and deserializers for
this. Also
        corresponding changes are made for fixtures.

  5. Testing Phase II (June 22nd – June 29th )
        New test cases will be added to ensure Django is still fully
backwards
        compatible and the new features pass the test too.

  6. Coding Phase III (June 30th – July 18th )
        Reverse relations serialization will be added. Relevant APIs will be

        implemented. Additions to DeserializedObject and save will be made
to contain
        and save reversed_objects. These will be implemented for fixtures
too.
        Mid Term evaluations happen during this phase.

  7. Testing Phase III (July 19th – July 26th )
        New test cases will be added for testing reverse relations
serialization and
        backwards compatibility.

  8. Requesting for community wide Reviews, testing and evaluation
    (July 27th – August 2nd )
        Final phase of testing of the overall project, obtaining and
consolidating the
        results and evaluation of the results. Requesting community to help
me in
        final testing.

  9.  Scrubbing Code, Wrap-Up, Documentation (August 3rd – August 10th )
        Fixing major and minor bugs if any and merging the project with the
Django
        SVN Trunk. Writing User and Developer documentations and
finalization.

~~~~~~~~~~~~~~
Where?
~~~~~~~~~~~~~~

   I am already comfortable with the django-devel mailing-list and IRC
channel
#django-...@freenode.net. I will be able to contact my mentor in both of the
above
two ways and will also be available through google-talk(jabber). I am also
comfortable
with svn, git and mercurial since I was the SVN administrator for 2 academic
projects
and git administrator for 1 project.

~~~~~~~~~~
Why Me?
~~~~~~~~~~

  I am a 4th Year undergraduate student pursuing Information Science and
Engineering
as a major at BMSCE, Bangalore, India(IST). Have been using and advocating
Free and
Open Source Softwares from past 5 years. Have been one of the main
coordinators of
BMSLUG. Have given various talks and conducted workshops on FOSS tools:
- Most importantly, recently I conducted a Python and *Django* workshop for
beginners at
  NIT, Calicut, a premium Insititution around.
- How to contribute to FOSS? - A Hands-On hackathon using GNUSim8085 as
example.

http://groups.google.com/group/bms-lug/browse_thread/thread/0c9ca2367966727a
- Have been actively participating in various FOSS Communities by reporting
bugs to
  communities like Ubuntu, GNOME, RTEMS, KDE.
- I was a major contributor and writer of the KDE's first-ever Handbook.
http://img518.imageshack.us/img518/9796/hb1o.png
http://img518.imageshack.us/img518/4296/hb2.png

I have been contributing patches and code to various FOSS communities, major
ones being:
- GNUSim8085 (http://is.gd/p5wZ , http://is.gd/p5xK)
- KDE Step (http://is.gd/oci7)
- RTEMS
- Melange (The GSoC Web App.
http://code.google.com/p/soc/source/browse/trunk/AUTHORS)

My Django Work:
I was interested in contributing to Django even before GSoC flashed to me.
Discussed
with David Crammer about Ticket #373 on #django-dev. I read the Django ORM
code
required for that, but could not write any code myself. Thanks to University
coursework.
I have had some discussions about fixing ticket #8161 on django-devel list
(http://is.gd/obr2) but unfortunately it was fixed. So I am applying for
GSoC as I feel
it lowers the barrier to get started.
http://groups.google.com/group/django-developers/browse_thread/thread/5461dae3cf8d5d6a

   I have a fair understanding of concepts of Python and have One and half
years of
Python experience. I have a fair understanding on Django ORM code because of
my
previous work. I am getting used to Serialization Code as I am writing this
proposal and
have no problems with it. Also I am using Django from 1 year for some of my
Webapps.

   Since I have been working with FOSS communities I have a good
understanding of
FOSS Development methodologies of communicating with people, using Ticket
tracker of
Django, coding and testing.

   Lastly I want to express my deep commitment for this project and Django.
I'm fully
available this summer without any other commitments, will tune my day/night
rhythm
as per my mentor's requirement and assure a dedicated work of 35-40
hours/week.
Also I will assure that I will continue my commitments with Django well
after GSoC.
If you find any part of this proposal is not clear please contact me.

~~~~~~~~~~~~~~~~~~~~~~~~
Important Links and URLs
~~~~~~~~~~~~~~~~~~~~~~~~
  My Blog: http://madhusudancs.info
  My CV :
http://www.madhusudancs.info/sites/default/files/madhusudancsCV.pdf


-- 
Thanks and regards,
 Madhusudan.C.S

Blogs at: www.madhusudancs.info
Official Email ID: madhusu...@madhusudancs.info

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en
-~----------~----~----~----~------~----~------~--~---

[GSoC] Proposal for discussion about Serialization requirements and requesting for Review

Reply via email to