Thanks to everyone for the suggestions, especially Tim for taking the
time to write up what I think is a really good idea. I'm going to take
a crack at this and if it works out and I can get the algorithm tuned
we'll release it as a gem (or possibly a plugin).

Once again, thanks for all the suggestions, it's what makes this list
great.

Dale


On Jun 20, 10:43 pm, timr <[email protected]> wrote:
> Hi Dale,
> It is a good ruby question (and rails is a ruby framework--so I think
> it is fair game). I don't know that the Levenshtien suggestion will be
> that helpful. (You can try it with require 'text', since it is part of
> the built in text module) The algorithm it uses is based on the number
> of changes that need to be made in one string to get a second
> (deletions, substitions, and additions). It is nice for comparing
> words for possible misspellings etc. But in your case, if you want to
> compare content, you need an approach that focuses on word frequency
> and context. Here Levenshtien is not the right tool (at least, doesn't
> seem so to me).
>
> Endtagger looks interesting module--could be useful.
>
> Here is a third proposal--What about breaking the text into chunks--
> one-word fragments, two-word fragments, three-word fragments and then
> doing a subtraction of one array of fragments from the other
> (fragments generated from 2nd sentence)? A perfect match would leave
> an empty array, while less perfect matches would leave more fragments.
> It might take some tinkering to get the algorithm tuned right, but how
> much tinkering depends on how much information you need to get from
> the poorer matches. The single word fragments compare for content, the
> double and triple word fragments would compare context.
> Good luck,
> Tim
>
> On Jun 19, 5:25 pm, PeteSalty <[email protected]> wrote:
>
> > This isn't really a Rails post but this group has given such great
> > responses to a range of questions over the years I though I'd ask
> > anyway.
>
> > I've been tasked with writting a Rails app that takes a block of text,
> > anywhere from about 50 characters up to 300 characters - about a
> > sentance or two, and compares it to other similar sized blocks of text
> > and compares how similar they are, content wise and contextually. It
> > doesn't have to be perfect but it has to be reasonably close. I was
> > thinking that it would be good to be able to get a numerical score
> > depending on how close they were (90 is really close, 20 is not very
> > close at all) but I'm certainly open to ideas.
>
> > Anyway, the problem is I have no idea how to do this or even where to
> > look to get started. I really doubt that there is already a Ruby
> > library to do this (although that would rock) , or a Rails plug-in
> > (although that would rock really hard) so I'm more looking for ideas
> > on what I should be reading to get a sense on how to start on this.
> > Anything would help, theoretical ideas, technical papers, Wikipedia
> > articles, anything.
>
> > Anyway, any suggestions are greatly appreciated.
> > Dale
>
>
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups "Ruby 
on Rails: Talk" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/rubyonrails-talk?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to