Hi all,
this is My Evil Master Plan(tm) about the new format for movie
titles, adopted by IMDb on both the web and the plain text data files.
If you haven't noticed, now everything is "The Title" and the
old "Title, The" is gone.

In short:

[general]
- start using the new format internally.  Now movie.data['title']
  is like "Title, The", converted when a user accesses movie['title'].
  In the future movie['canonical title'] and friends will require
  a conversion, instead.
- remind to myself: functions involved in "Exact Title Searches"
  must be checked.

[http/mobile]
- a lot (all?) of data from the web was already in the new format,
  being more readable.
  Actually it's converted before the creation of a Movie instance;
  this must change (more on this later).

[sql]
- the title in the database will use the new format too; this means
  we should be aware of this when the data is retrieved.
- the main problem here is that we also need to handle users'
  searches.
  At insert-time, we need to check that the title variations (used
  to compute a set of soundex values) are correct.
  Specular changes will be needed at retrieve time.
  The unfunny part is that the cutils.c module will requires
  fixes, too.  I already have a royal headache. :-/

[local]
- this is probably the last nail on the local's coffin.  Not a big
  deal, and it will probably stay here for the next release (and
  removed later: remember that some portions of the code are in
  common between 'local' and 'sql').


The key to everything are the imdb.utils.analyze_title and
imdb.utils.build_title fuctions.
analyze_title (from a string to a dictionary) takes the 'canonical'
argument, default (more or less) to False; when True it _first_
convert the string in the old format.  This behavior is used a lot
in 'http'.
build_title (from a dict to a string) has a 'canonical' argument
too: when True (default False) the old format is returned.

For a moment I had the itch to invert the logic of the 'canonical'
argument (from "should I convert it to" to "is the input in"),
but this is a change at API level, and... [1]


[test-suite]
I've introduced a new test to check that the movie['title']
is in the new format.
Actually it works, but when we'll remove the current transformation
between the internal (movie.data['title']) format and the 'The Title'
one, it will fail spectacuraly. :-)
It can be used by itself, with:
  python ./test_parser.py -t -M -H -X 2>&1 | less


As usual, I change my mind about 6 times a day on every subject,
and so nothing is written in the stone. :-)


+++
[1] not a major change (the returned type won't change), but...

-- 
Davide Alberani <davide.alber...@gmail.com> [PGP KeyID: 0x465BFD47]
http://erlug.linux.it/~da/

------------------------------------------------------------------------------
Stay on top of everything new and different, both inside and 
around Java (TM) technology - register by April 22, and save
$200 on the JavaOne (SM) conference, June 2-5, 2009, San Francisco.
300 plus technical and hands-on sessions. Register today. 
Use priority code J9JMT32. http://p.sf.net/sfu/p
_______________________________________________
Imdbpy-devel mailing list
Imdbpy-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/imdbpy-devel

Reply via email to