On Apr 06, Martin Kirst <[EMAIL PROTECTED]> wrote: > Hello everybody!
Hi! > I think it is not a good idea to pre calculate all the combinations > of a movie title (person name and so on) which are possible. I fear I've not expressed myself clearly (Pardon! Even my Klingon is better than my English... ;-) Soundex, by definition, is a "phonetic coding"; it means that it takes _a single string_ and convert it to a representation of "how it sounds, when you speak it". Then, this representation (normally a fixed length short string, with letters and numbers) can be used to compare different strings; if two strings have an _identical_ representation, they "sound similar". E.g. - using the SOUNDEX algorithm of the MySQL engine: SOUNDEX('GINO PINO') -> G515 SOUNDEX('GINO FINO') -> G515 # they sound similar. As you can see the strings are not directly compared for similarity; they just happen to share the same soundex value. Caveat: what "sound similar" is quite arbitrary, and varies with the languages; normally the soundex algorithms are aimed to the English language. There are other algorithms (NYSIIS seems promising), but so long I'll stay with soundex - it's more simple and well known, and it seems to work good enough even with other languages. On the other side, a "similarity metric" is a different beast: it takes _two_ strings and tries to define "how much they are similar", normally returning a float number in the 0...1 range (0.0: they are totally unrelated; 1.0: they are the same string). In this family you'll find Hamming distance, Ratcliff-Obershelp, Jaro-Winkler and some more. They are (relatively) _heavy_ algorithms; requiring much more CPU power than phonetic coding algos. Worse than ever, you need to redo the match (and the math) on _every single entry in the database_ every time the user is searching for a title/name; they can't be precalculated, since the returned value depends on both the searched string and the strings in the db. What we want to do is this (remember that everything is born by the fact that not every db engine support SOUNDEX, like MySQL do): 1. in the db's table of titles (and names) add one (or two, or three - I think no more are needed) columns to store soundex for the title in the row. At insert time, calculate the soundex of the title (and - probably the soundex of some variations like "Title, The", "The Title" and "Title") This - hopefully - will not take a lot of CPU time, and just a few _megabytes_ for both movies and title in the worst case - ...if I've done the math right. 2. when a user searchs for a title/name, calculate the soundex of the searched title/name (and variation on the given title) and ask the db engine every row with matching soundex. This is just a comparison on short strings, and would not require a lot to the db engine. 3. on this _small_ subset of the database of titles/names that "sound like" the searched one, apply a similarity metric (Ratcliff-Obershelp) to sort the results by similarity - against the seached title. Note: soundex _can't_ tell you how much two strings are similar; it can only tell that they are pronounced (_very_ more or less) the same. > Because of, If we assume 10.000 movie titles (not really counted > yet) and assume 10 combinations for each title we'll have 100.000 > titles in our database. You're right: it would be totally unmanageable. Pre-calculating differences between titles in the database will go out of control for both memory and CPU even with few hundreds of titles, I fear. Moreover it won't be useful: we need to compare the titles in the database against what the _users_ give us, not against other titles in the db. Oh, by the way we have 950.000 title and 2.000.000 names in the database. And counting... :-) > And on my machine I already have sacrificed 1252 MegaBytes :-/ I know, I know. :-/ There are a lot of data, there is no escape [1]. PS: if you know that you don't need some data [2], you can just remove the ".list.gz" file from the directory, when you run the imdbpy2sql.py script; it will be ignored! > 0. don't pre calculate the soundex word, > do it just the normal way like today As explained, MySQL supports SOUNDEX, but it's not standard by any means. :-/ > I'm also open minded to to write a little benchmark, for checking > both approaches, which is the more promising one. I think there will be a lot to test, in the future; I fear that the switch to SQLObject and the support for tv series episodes will increase the time needed to run the imdbpy2sql.py script by a factor of 2 (or 3 or 4 or - True Phear(tm)! - 5 or 6). On the other side, I think that at _usage_ time, performances won't be greatly affected (maybe no more than 1.2x or 1.5x) > Greetings Many thanks for your hints! I hope to have explained our Evil Master Plan to take over the world and the IMDb's database. ;-) If you still have doubts/questions, just ask! +++ [1] not totally true: there are a lot (_a bloody damn lot_) of repetitions in almost every tables. Think about information like the ones contained in the color-info and genres ".list.gz" files: there are just a few values, but in the database for every movies the whole string ("color", "Comedy", ...) is stored. The imdbpy2sql.py script is not very well written, but it's short and relatively simple; once you've read the information from a file there are just three very simple ways to automatically put the data in the db (MoviesCache, PersonsCache and SQLData classes' instances). They are generic and simple; so far I can't see an easy way to manage these repetitions (manage every single case will be a nightmare not worthing the saved space), and so this is not a priority to me, right now. [2] maybe you're not interested in special effects companies, or goofies or distributors or... -- Davide Alberani <[EMAIL PROTECTED]> [PGP KeyID: 0x465BFD47] http://erlug.linux.it/~da/ ------------------------------------------------------- This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642 _______________________________________________ Imdbpy-devel mailing list Imdbpy-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/imdbpy-devel