[phpug] Text comparison - long strings and 'likeness'

Hamish Campbell Fri, 30 Apr 2010 21:49:10 -0700

Hey all, this might be bit OT because I'd prefer to do this on the
database side, but this group always has good ideas and I'd expect the
solution could be at a scripting or database level anyway.


I'm looking for a simple method / algorithm to compare the similarity
of two potentially long pieces of text.

One strategy I've considered is storing the metaphone of the string
and calculating the Levenshtein distance between them. It seems this
would give quite a good 'fuzzy' match if the strings are of a similar
length, but I'd also like to flag cases where one string might be a
very close match to a piece of a larger string. Would making the
'deletion' and 'insertion' costs low help in this regard?

Relevant functions:
http://nz.php.net/manual/en/function.metaphone.php
http://nz.php.net/manual/en/function.levenshtein.php

Or is this something Sphinx can be configured to do?

I'm trying to achieve something similar to plagiarism detection
services like turnitin.com do (although that's not why I'm doing it)
where matches are more likely to be very close, so it doesn't have to
be that complicated.

Any other good ideas?

-- 
NZ PHP Users Group: http://groups.google.com/group/nzphpug
To post, send email to [email protected]
To unsubscribe, send email to
[email protected]

[phpug] Text comparison - long strings and 'likeness'

Reply via email to