I use solr (http://lucene.apache.org/solr) to do something similar for one of my projects. Solr is a fully fledged index system, but it does have 'more like this' functionality. You basically ask solr "which records are similar to record x". It can also give you a similarity score. I've found this works very well when it is run off a large text field. I'm not sure how it calclates how similar records are, but I'm always impressed by its accuracy in its guesses. It might be worth evaluating.

Aaron

Hamish Campbell wrote:
Hey all, this might be bit OT because I'd prefer to do this on the
database side, but this group always has good ideas and I'd expect the
solution could be at a scripting or database level anyway.

I'm looking for a simple method / algorithm to compare the similarity
of two potentially long pieces of text.

One strategy I've considered is storing the metaphone of the string
and calculating the Levenshtein distance between them. It seems this
would give quite a good 'fuzzy' match if the strings are of a similar
length, but I'd also like to flag cases where one string might be a
very close match to a piece of a larger string. Would making the
'deletion' and 'insertion' costs low help in this regard?

Relevant functions:
http://nz.php.net/manual/en/function.metaphone.php
http://nz.php.net/manual/en/function.levenshtein.php

Or is this something Sphinx can be configured to do?

I'm trying to achieve something similar to plagiarism detection
services like turnitin.com do (although that's not why I'm doing it)
where matches are more likely to be very close, so it doesn't have to
be that complicated.

Any other good ideas?

------------------------------------------------------------------------


No virus found in this incoming message.
Checked by AVG - www.avg.com Version: 9.0.801 / Virus Database: 271.1.1/2846 - Release Date: 05/01/10 06:27:00


--
NZ PHP Users Group: http://groups.google.com/group/nzphpug
To post, send email to [email protected]
To unsubscribe, send email to
[email protected]

Reply via email to