I have successfully used a computer security program called ssdeep for this purpose. I use it to check if there is a similar or exactly the same file already in the document bank when a user uploads a new file to the system. It works very well. Unfortunately I cannot release my PHP wrapping class as it is the property of my employer.
Hope that helps. -original message- Subject: [phpug] Text comparison - long strings and 'likeness' From: Hamish Campbell <[email protected]> Date: 01.05.2010 05:49 Hey all, this might be bit OT because I'd prefer to do this on the database side, but this group always has good ideas and I'd expect the solution could be at a scripting or database level anyway. I'm looking for a simple method / algorithm to compare the similarity of two potentially long pieces of text. One strategy I've considered is storing the metaphone of the string and calculating the Levenshtein distance between them. It seems this would give quite a good 'fuzzy' match if the strings are of a similar length, but I'd also like to flag cases where one string might be a very close match to a piece of a larger string. Would making the 'deletion' and 'insertion' costs low help in this regard? Relevant functions: http://nz.php.net/manual/en/function.metaphone.php http://nz.php.net/manual/en/function.levenshtein.php Or is this something Sphinx can be configured to do? I'm trying to achieve something similar to plagiarism detection services like turnitin.com do (although that's not why I'm doing it) where matches are more likely to be very close, so it doesn't have to be that complicated. Any other good ideas? -- NZ PHP Users Group: http://groups.google.com/group/nzphpug To post, send email to [email protected] To unsubscribe, send email to [email protected] -- NZ PHP Users Group: http://groups.google.com/group/nzphpug To post, send email to [email protected] To unsubscribe, send email to [email protected]
