Re: [Imdbpy-devel] SOUNDEX and imdbpy

Davide Alberani Fri, 07 Apr 2006 08:32:03 -0700

On Apr 06, Martin Kirst <[EMAIL PROTECTED]> wrote:

> Hello everybody!


Hi!

> I think it is not a good idea to pre calculate all the combinations
> of a movie title (person name and so on) which are possible.

I fear I've not expressed myself clearly (Pardon!  Even my Klingon is
better than my English... ;-)

Soundex, by definition, is a "phonetic coding"; it means that it takes
_a single string_ and convert it to a representation of "how it sounds,
when you speak it".  Then, this representation (normally a fixed length
short string, with letters and numbers) can be used to compare different
strings; if two strings have an _identical_ representation, they "sound
similar".
E.g. - using the SOUNDEX algorithm of the MySQL engine:
  SOUNDEX('GINO PINO') -> G515
  SOUNDEX('GINO FINO') -> G515  # they sound similar.

As you can see the strings are not directly compared for similarity;
they just happen to share the same soundex value.

Caveat: what "sound similar" is quite arbitrary, and varies with the
languages; normally the soundex algorithms are aimed to the English
language.  There are other algorithms (NYSIIS seems promising), but
so long I'll stay with soundex - it's more simple and well known, and
it seems to work good enough even with other languages.

On the other side, a "similarity metric" is a different beast: it takes
_two_ strings and tries to define "how much they are similar", normally
returning a float number in the 0...1 range (0.0: they are totally
unrelated; 1.0: they are the same string).
In this family you'll find Hamming distance, Ratcliff-Obershelp,
Jaro-Winkler and some more.
They are (relatively) _heavy_ algorithms; requiring much more CPU
power than phonetic coding algos.
Worse than ever, you need to redo the match (and the math) on _every
single entry in the database_ every time the user is searching for
a title/name; they can't be precalculated, since the returned value
depends on both the searched string and the strings in the db.

What we want to do is this (remember that everything is born by the
fact that not every db engine support SOUNDEX, like MySQL do):

1. in the db's table of titles (and names) add one (or two, or three -
   I think no more are needed) columns to store soundex for the title
   in the row.
   At insert time, calculate the soundex of the title (and - probably
   the soundex of some variations like "Title, The", "The Title" and "Title")
   This - hopefully - will not take a lot of CPU time, and just a few
   _megabytes_ for both movies and title in the worst case - ...if I've
   done the math right.
2. when a user searchs for a title/name, calculate the soundex of the
   searched title/name (and variation on the given title) and ask the
   db engine every row with matching soundex.  This is just a comparison
   on short strings, and would not require a lot to the db engine.
3. on this _small_ subset of the database of titles/names that "sound
   like" the searched one, apply a similarity metric (Ratcliff-Obershelp)
   to sort the results by similarity - against the seached title.
   Note: soundex _can't_ tell you how much two strings are similar; it
   can only tell that they are pronounced (_very_ more or less) the same.

> Because of, If we assume 10.000 movie titles (not really counted
> yet) and assume 10 combinations for each title we'll have 100.000
> titles in our database.

You're right: it would be totally unmanageable.  Pre-calculating
differences between titles in the database will go out of control
for both memory and CPU even with few hundreds of titles, I fear.
Moreover it won't be useful: we need to compare the titles in the
database against what the _users_ give us, not against other titles
in the db.
Oh, by the way we have 950.000 title and 2.000.000 names in the database.
And counting... :-)

> And on my machine I already have sacrificed 1252 MegaBytes :-/

I know, I know. :-/
There are a lot of data, there is no escape [1].
PS: if you know that you don't need some data [2], you can just remove
the ".list.gz" file from the directory, when you run the imdbpy2sql.py
script; it will be ignored!

> 0. don't pre calculate the soundex word,
>    do it just the normal way like today

As explained, MySQL supports SOUNDEX, but it's not standard by
any means. :-/

> I'm also open minded to to write a little benchmark, for checking
> both approaches, which is the more promising one.

I think there will be a lot to test, in the future; I fear that
the switch to SQLObject and the support for tv series episodes
will increase the time needed to run the imdbpy2sql.py script
by a factor of 2 (or 3 or 4 or - True Phear(tm)! - 5 or 6).
On the other side, I think that at _usage_ time, performances
won't be greatly affected (maybe no more than 1.2x or 1.5x)

> Greetings

Many thanks for your hints!
I hope to have explained our Evil Master Plan to take over the world
and the IMDb's database. ;-)
If you still have doubts/questions, just ask!


+++
[1] not totally true: there are a lot (_a bloody damn lot_) of repetitions
    in almost every tables.  Think about information like the ones
    contained in the color-info and genres ".list.gz" files: there are
    just a few values, but in the database for every movies the whole
    string ("color", "Comedy", ...) is stored.
    The imdbpy2sql.py script is not very well written, but it's short
    and relatively simple; once you've read the information from a file
    there are just three very simple ways to automatically put the data
    in the db (MoviesCache, PersonsCache and SQLData classes' instances).
    They are generic and simple; so far I can't see an easy way to
    manage these repetitions (manage every single case will be a
    nightmare not worthing the saved space), and so this is not
    a priority to me, right now.
[2] maybe you're not interested in special effects companies, or
    goofies or distributors or...
-- 
Davide Alberani <[EMAIL PROTECTED]> [PGP KeyID: 0x465BFD47]
http://erlug.linux.it/~da/


-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Imdbpy-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/imdbpy-devel

Re: [Imdbpy-devel] SOUNDEX and imdbpy

Reply via email to