I took a deeper look at soundex codes to match movie titles.

My conclusions:
1. a variable length up to 5 (an upper case char and at most 4 digits)
   is enough.
   With the actual list of titles (~700.000 titles, excluding episodes
   titles) we get about 22.000 different soundex codes, each one
   - on average - referring to 30 titles.
   Worst case is for titles without a soundex code (only vowels,
   digits, non-ascii chars...), matching ~2.500 titles (the second
   worst case matches ~1.100 titles).
   They are good numbers: even in the worst case we can pass the
   list of titles to the ratcliff-obershelp function, to sort the
   results.  It will takes just a fraction of a second, even in this
   case.
   I'm still not sure that the numbers "at the other end" are
   as good: there are about 6.000 soundex codes which match
   5 or less titles; I think that only practical use will tell us
   if the search for some titles is too difficult (i.e.: it returns
   to few/many good results).

2. Episode titles can be more problematic: there are a lot of titles,
   about 90.000, without a "real" title but with (#season.episode)
   or (2005-06-12) notations.
   We know when the user is searching for a normal title or the title
   of an episode, so we can take these two things separated.

   Metacode to search for an episode:
     title_dict = analyE_title('"The Series" {The Episode}')
     episode_title = title_dict['title']
     series_title = title_dict['episode of']['title']
     matching_series = SQLQuery( [select IDs of non-episode with soundex
                                  matching soundex(series_title)] )
     matching_episodes = SQLQuery( [episodes with episodeOfID in the
                                    matching_series list and with soundex
                                    matching soundex(episode_title)] )
     result_list = sort matching_episodes using ratcliff-obershelp.

   I really don't know if the performances will acceptable.

3. there is no need to add more than one "soundexCode" column in
   the database.  All we need is to calculate the soundex code of
   the title, having care to _remove_ the ending article (", The",
   ", A", ...)
   Calculating the soundex code of other variations doesn't seem
   to add any value to the search for a title.
   At "search time", the title provided by the user will be pruned
   of the article, as well.

4. The soundex() function can return None (NULL in the db) instead
   of "0", if no code is calculable.

5. I'm not sure, but the "name" table is probably different and
   maybe that more than one "soundexCode" can be useful.   


Within a day, I'll commit the needed functions to the cvs.

-- 
Davide Alberani <[EMAIL PROTECTED]> [PGP KeyID: 0x465BFD47]
http://erlug.linux.it/~da/


-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Imdbpy-devel mailing list
Imdbpy-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/imdbpy-devel

Reply via email to