I took a deeper look at soundex codes to match movie titles.
My conclusions:
1. a variable length up to 5 (an upper case char and at most 4 digits)
is enough.
With the actual list of titles (~700.000 titles, excluding episodes
titles) we get about 22.000 different soundex codes, each one
- on average - referring to 30 titles.
Worst case is for titles without a soundex code (only vowels,
digits, non-ascii chars...), matching ~2.500 titles (the second
worst case matches ~1.100 titles).
They are good numbers: even in the worst case we can pass the
list of titles to the ratcliff-obershelp function, to sort the
results. It will takes just a fraction of a second, even in this
case.
I'm still not sure that the numbers "at the other end" are
as good: there are about 6.000 soundex codes which match
5 or less titles; I think that only practical use will tell us
if the search for some titles is too difficult (i.e.: it returns
to few/many good results).
2. Episode titles can be more problematic: there are a lot of titles,
about 90.000, without a "real" title but with (#season.episode)
or (2005-06-12) notations.
We know when the user is searching for a normal title or the title
of an episode, so we can take these two things separated.
Metacode to search for an episode:
title_dict = analyE_title('"The Series" {The Episode}')
episode_title = title_dict['title']
series_title = title_dict['episode of']['title']
matching_series = SQLQuery( [select IDs of non-episode with soundex
matching soundex(series_title)] )
matching_episodes = SQLQuery( [episodes with episodeOfID in the
matching_series list and with soundex
matching soundex(episode_title)] )
result_list = sort matching_episodes using ratcliff-obershelp.
I really don't know if the performances will acceptable.
3. there is no need to add more than one "soundexCode" column in
the database. All we need is to calculate the soundex code of
the title, having care to _remove_ the ending article (", The",
", A", ...)
Calculating the soundex code of other variations doesn't seem
to add any value to the search for a title.
At "search time", the title provided by the user will be pruned
of the article, as well.
4. The soundex() function can return None (NULL in the db) instead
of "0", if no code is calculable.
5. I'm not sure, but the "name" table is probably different and
maybe that more than one "soundexCode" can be useful.
Within a day, I'll commit the needed functions to the cvs.
--
Davide Alberani <[EMAIL PROTECTED]> [PGP KeyID: 0x465BFD47]
http://erlug.linux.it/~da/
-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Imdbpy-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/imdbpy-devel