I have successfully used a computer security program called ssdeep for this 
purpose. I use it to check if there is a similar or exactly the same file 
already in the document bank when a user uploads a new file to the system. It 
works very well. Unfortunately I cannot release my PHP wrapping class as it is 
the property of my employer.

Hope that helps.

-original message-
Subject: [phpug] Text comparison - long strings and 'likeness'
From: Hamish Campbell <[email protected]>
Date: 01.05.2010 05:49

Hey all, this might be bit OT because I'd prefer to do this on the
database side, but this group always has good ideas and I'd expect the
solution could be at a scripting or database level anyway.

I'm looking for a simple method / algorithm to compare the similarity
of two potentially long pieces of text.

One strategy I've considered is storing the metaphone of the string
and calculating the Levenshtein distance between them. It seems this
would give quite a good 'fuzzy' match if the strings are of a similar
length, but I'd also like to flag cases where one string might be a
very close match to a piece of a larger string. Would making the
'deletion' and 'insertion' costs low help in this regard?

Relevant functions:
http://nz.php.net/manual/en/function.metaphone.php
http://nz.php.net/manual/en/function.levenshtein.php

Or is this something Sphinx can be configured to do?

I'm trying to achieve something similar to plagiarism detection
services like turnitin.com do (although that's not why I'm doing it)
where matches are more likely to be very close, so it doesn't have to
be that complicated.

Any other good ideas?

-- 
NZ PHP Users Group: http://groups.google.com/group/nzphpug
To post, send email to [email protected]
To unsubscribe, send email to
[email protected]

-- 
NZ PHP Users Group: http://groups.google.com/group/nzphpug
To post, send email to [email protected]
To unsubscribe, send email to
[email protected]

Reply via email to