Thanks everyone, some great tips there. Cheers Hamish
On May 2, 1:49 pm, Nick Jenkin <[email protected]> wrote: > You can use the pearson correlation score for this. It's a simple > algorithm that compares two sets of data (in this case words, or word > counts), and returns a similarity score. > I've used it before for a similar thing and it was reasonably > effective. You'll have to google around as the code I have is > proprietary. > > The MoreLikeThis functionality in lucene/solr uses a similar approach > whereby a list of word frequencies is calculated and documents with a > similar set of word frequencies are boosted. > -Nick > > > > > > On Sat, May 1, 2010 at 9:02 PM, Richard Clark <[email protected]> wrote: > >http://www.postgresql.org/docs/8.4/static/textsearch.htmlmight help > > if you're using postgres > > > On 1 May 2010 16:49, Hamish Campbell <[email protected]> wrote: > >> Hey all, this might be bit OT because I'd prefer to do this on the > >> database side, but this group always has good ideas and I'd expect the > >> solution could be at a scripting or database level anyway. > > >> I'm looking for a simple method / algorithm to compare the similarity > >> of two potentially long pieces of text. > > >> One strategy I've considered is storing the metaphone of the string > >> and calculating the Levenshtein distance between them. It seems this > >> would give quite a good 'fuzzy' match if the strings are of a similar > >> length, but I'd also like to flag cases where one string might be a > >> very close match to a piece of a larger string. Would making the > >> 'deletion' and 'insertion' costs low help in this regard? > > >> Relevant functions: > >>http://nz.php.net/manual/en/function.metaphone.php > >>http://nz.php.net/manual/en/function.levenshtein.php > > >> Or is this something Sphinx can be configured to do? > > >> I'm trying to achieve something similar to plagiarism detection > >> services like turnitin.com do (although that's not why I'm doing it) > >> where matches are more likely to be very close, so it doesn't have to > >> be that complicated. > > >> Any other good ideas? > > >> -- > >> NZ PHP Users Group:http://groups.google.com/group/nzphpug > >> To post, send email to [email protected] > >> To unsubscribe, send email to > >> [email protected] > > > -- > > NZ PHP Users Group:http://groups.google.com/group/nzphpug > > To post, send email to [email protected] > > To unsubscribe, send email to > > [email protected] > > -- > NZ PHP Users Group:http://groups.google.com/group/nzphpug > To post, send email to [email protected] > To unsubscribe, send email to > [email protected] -- NZ PHP Users Group: http://groups.google.com/group/nzphpug To post, send email to [email protected] To unsubscribe, send email to [email protected]
