I took a deeper look at soundex codes to match movie titles. My conclusions: 1. a variable length up to 5 (an upper case char and at most 4 digits) is enough. With the actual list of titles (~700.000 titles, excluding episodes titles) we get about 22.000 different soundex codes, each one - on average - referring to 30 titles. Worst case is for titles without a soundex code (only vowels, digits, non-ascii chars...), matching ~2.500 titles (the second worst case matches ~1.100 titles). They are good numbers: even in the worst case we can pass the list of titles to the ratcliff-obershelp function, to sort the results. It will takes just a fraction of a second, even in this case. I'm still not sure that the numbers "at the other end" are as good: there are about 6.000 soundex codes which match 5 or less titles; I think that only practical use will tell us if the search for some titles is too difficult (i.e.: it returns to few/many good results).
2. Episode titles can be more problematic: there are a lot of titles, about 90.000, without a "real" title but with (#season.episode) or (2005-06-12) notations. We know when the user is searching for a normal title or the title of an episode, so we can take these two things separated. Metacode to search for an episode: title_dict = analyE_title('"The Series" {The Episode}') episode_title = title_dict['title'] series_title = title_dict['episode of']['title'] matching_series = SQLQuery( [select IDs of non-episode with soundex matching soundex(series_title)] ) matching_episodes = SQLQuery( [episodes with episodeOfID in the matching_series list and with soundex matching soundex(episode_title)] ) result_list = sort matching_episodes using ratcliff-obershelp. I really don't know if the performances will acceptable. 3. there is no need to add more than one "soundexCode" column in the database. All we need is to calculate the soundex code of the title, having care to _remove_ the ending article (", The", ", A", ...) Calculating the soundex code of other variations doesn't seem to add any value to the search for a title. At "search time", the title provided by the user will be pruned of the article, as well. 4. The soundex() function can return None (NULL in the db) instead of "0", if no code is calculable. 5. I'm not sure, but the "name" table is probably different and maybe that more than one "soundexCode" can be useful. Within a day, I'll commit the needed functions to the cvs. -- Davide Alberani <[EMAIL PROTECTED]> [PGP KeyID: 0x465BFD47] http://erlug.linux.it/~da/ ------------------------------------------------------- This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642 _______________________________________________ Imdbpy-devel mailing list Imdbpy-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/imdbpy-devel