Hey all, this might be bit OT because I'd prefer to do this on the database side, but this group always has good ideas and I'd expect the solution could be at a scripting or database level anyway.
I'm looking for a simple method / algorithm to compare the similarity of two potentially long pieces of text. One strategy I've considered is storing the metaphone of the string and calculating the Levenshtein distance between them. It seems this would give quite a good 'fuzzy' match if the strings are of a similar length, but I'd also like to flag cases where one string might be a very close match to a piece of a larger string. Would making the 'deletion' and 'insertion' costs low help in this regard? Relevant functions: http://nz.php.net/manual/en/function.metaphone.php http://nz.php.net/manual/en/function.levenshtein.php Or is this something Sphinx can be configured to do? I'm trying to achieve something similar to plagiarism detection services like turnitin.com do (although that's not why I'm doing it) where matches are more likely to be very close, so it doesn't have to be that complicated. Any other good ideas? -- NZ PHP Users Group: http://groups.google.com/group/nzphpug To post, send email to [email protected] To unsubscribe, send email to [email protected]
