http://www.postgresql.org/docs/8.4/static/textsearch.html might help if you're using postgres
On 1 May 2010 16:49, Hamish Campbell <[email protected]> wrote: > Hey all, this might be bit OT because I'd prefer to do this on the > database side, but this group always has good ideas and I'd expect the > solution could be at a scripting or database level anyway. > > I'm looking for a simple method / algorithm to compare the similarity > of two potentially long pieces of text. > > One strategy I've considered is storing the metaphone of the string > and calculating the Levenshtein distance between them. It seems this > would give quite a good 'fuzzy' match if the strings are of a similar > length, but I'd also like to flag cases where one string might be a > very close match to a piece of a larger string. Would making the > 'deletion' and 'insertion' costs low help in this regard? > > Relevant functions: > http://nz.php.net/manual/en/function.metaphone.php > http://nz.php.net/manual/en/function.levenshtein.php > > Or is this something Sphinx can be configured to do? > > I'm trying to achieve something similar to plagiarism detection > services like turnitin.com do (although that's not why I'm doing it) > where matches are more likely to be very close, so it doesn't have to > be that complicated. > > Any other good ideas? > > -- > NZ PHP Users Group: http://groups.google.com/group/nzphpug > To post, send email to [email protected] > To unsubscribe, send email to > [email protected] -- NZ PHP Users Group: http://groups.google.com/group/nzphpug To post, send email to [email protected] To unsubscribe, send email to [email protected]
