Thanks everyone, some great tips there.

Cheers
Hamish

On May 2, 1:49 pm, Nick Jenkin <[email protected]> wrote:
> You can use the pearson correlation score for this. It's a simple
> algorithm that compares two sets of data (in this case words, or word
> counts), and returns a similarity score.
> I've used it before for a similar thing and it was reasonably
> effective. You'll have to google around as the code I have is
> proprietary.
>
> The MoreLikeThis functionality in lucene/solr uses a similar approach
> whereby a list of word frequencies is calculated and documents with a
> similar set of word frequencies are boosted.
> -Nick
>
>
>
>
>
> On Sat, May 1, 2010 at 9:02 PM, Richard Clark <[email protected]> wrote:
> >http://www.postgresql.org/docs/8.4/static/textsearch.htmlmight help
> > if you're using postgres
>
> > On 1 May 2010 16:49, Hamish Campbell <[email protected]> wrote:
> >> Hey all, this might be bit OT because I'd prefer to do this on the
> >> database side, but this group always has good ideas and I'd expect the
> >> solution could be at a scripting or database level anyway.
>
> >> I'm looking for a simple method / algorithm to compare the similarity
> >> of two potentially long pieces of text.
>
> >> One strategy I've considered is storing the metaphone of the string
> >> and calculating the Levenshtein distance between them. It seems this
> >> would give quite a good 'fuzzy' match if the strings are of a similar
> >> length, but I'd also like to flag cases where one string might be a
> >> very close match to a piece of a larger string. Would making the
> >> 'deletion' and 'insertion' costs low help in this regard?
>
> >> Relevant functions:
> >>http://nz.php.net/manual/en/function.metaphone.php
> >>http://nz.php.net/manual/en/function.levenshtein.php
>
> >> Or is this something Sphinx can be configured to do?
>
> >> I'm trying to achieve something similar to plagiarism detection
> >> services like turnitin.com do (although that's not why I'm doing it)
> >> where matches are more likely to be very close, so it doesn't have to
> >> be that complicated.
>
> >> Any other good ideas?
>
> >> --
> >> NZ PHP Users Group:http://groups.google.com/group/nzphpug
> >> To post, send email to [email protected]
> >> To unsubscribe, send email to
> >> [email protected]
>
> > --
> > NZ PHP Users Group:http://groups.google.com/group/nzphpug
> > To post, send email to [email protected]
> > To unsubscribe, send email to
> > [email protected]
>
> --
> NZ PHP Users Group:http://groups.google.com/group/nzphpug
> To post, send email to [email protected]
> To unsubscribe, send email to
> [email protected]

-- 
NZ PHP Users Group: http://groups.google.com/group/nzphpug
To post, send email to [email protected]
To unsubscribe, send email to
[email protected]

Reply via email to