On May 18, Bram Duvigneau <[EMAIL PROTECTED]> wrote:

> I would like to get the new diffs every week and update my SQL
> database accordingly. As far as I see, this functionality is not
> yet implemented. Any thoughts on how I could do this?

For the 'sql' data access system, you can't.
It can be possible (but I've never tried it) using the 'local'
data access system, and specifically using the proprietary (but
free of charge) moviedb program, that I _think_ can support updates
from diff files.
The moviedb is distributed alongside the plain text data files,
and must be used to generate the files accessed by 'local'.

Modifying the imdbpy2sql.py script to support the diff files is
an idea I'd love to implement, but I fear it's beyond my skills
as a developer.

Anyway, I describe here a possibile way to proceed.

Actually imdbpy2sql.py parses (more or less) the files line by line,
storing the processed data (ready to be sent to the db) in dictionaries
and lists; from time to time these data sets are flushed to the database,
using the executemany method of a cursor object - this is done for
performance reasons.

Accessing the diff files, you can't proceed this way: processing many
files imdbpy2sql.py needs to know about the "contour" of the line we're
managing; think about the "writers.list" file: we have to know who is
the person we're referring, inserting movie titles.
In the diff files the context is lost, and if you read two lines like:
  53927a54450
  >                       Urban Thunder (1999)

How can you know about the person who wrote this movie?

So, after a check that the patch is not broken, you have to keep
original file and patch separated (without applying it) and begin
managing the files into pieces of "blocks of information".
E.g. in the writers.list file, manage every movie written by a given
person as a single block of information; then you can scan the
database to get the stored references (I think you should, at
least, retrieve the personID).  Done that, you can read the diff file
and identify which lines refer to the currently considered block
of information.  If there are differences, probably the safest way
to act is to completely remove from the database the old block of
information, and then apply (in memory) the patch and insert into the
database the new block of information.
So for every block of information there should be a generic structure,
which can be used both to insert and to delete the data to/from the
database.

Now... in the middle of this procedure there are some caveats:
- performances: the current implementation is - after all - fast enough
  if you consider how many transformations are applied to the data and
  how many information we're managing.
  This new implementation will probably need to resort to a number of
  dirty tricks to have decent performances.
- there are a lot of corner cases to deal with: think about renamed movies
  (to fix a spelling, the production year for future releases and so on)
  or person names (unifying two persons, if they are the same, or adding
  a (I) to a name when an homonym is added to the database).
- we need to manage cases where the patches are wrong or the data is no
  more consistent (quite easy: erase everything and restart from zero,
  but these cases should be recognized).
- manage - again, with decent performance - the first insert of the data
  (an empty database and no patches to apply): it's probably better if
  this case is managed by the same script that handles the patches (having
  to support _two_ long and complex scripts with very similar intents
  seems way too difficult).
- stay db-independent, using SQLObject (for a number of reasons the switch
  to another ORM doesn't show very high in my priorities).


Having said that, I'm open to every hint and I can consider the
possibility to write some prototypes, if someone is willing to
help in designing and coding such an frightful beast. :-)

So, anonymous developer out there: if you're brave enough to help
with this near-impossible task, stand up and shout all your pride
of Python developer! ;-)


-- 
Davide Alberani <[EMAIL PROTECTED]> [PGP KeyID: 0x465BFD47]
http://erlug.linux.it/~da/

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Imdbpy-devel mailing list
Imdbpy-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/imdbpy-devel

Reply via email to