Thanks to everyone for the suggestions, especially Tim for taking the time to write up what I think is a really good idea. I'm going to take a crack at this and if it works out and I can get the algorithm tuned we'll release it as a gem (or possibly a plugin).
Once again, thanks for all the suggestions, it's what makes this list great. Dale On Jun 20, 10:43 pm, timr <[email protected]> wrote: > Hi Dale, > It is a good ruby question (and rails is a ruby framework--so I think > it is fair game). I don't know that the Levenshtien suggestion will be > that helpful. (You can try it with require 'text', since it is part of > the built in text module) The algorithm it uses is based on the number > of changes that need to be made in one string to get a second > (deletions, substitions, and additions). It is nice for comparing > words for possible misspellings etc. But in your case, if you want to > compare content, you need an approach that focuses on word frequency > and context. Here Levenshtien is not the right tool (at least, doesn't > seem so to me). > > Endtagger looks interesting module--could be useful. > > Here is a third proposal--What about breaking the text into chunks-- > one-word fragments, two-word fragments, three-word fragments and then > doing a subtraction of one array of fragments from the other > (fragments generated from 2nd sentence)? A perfect match would leave > an empty array, while less perfect matches would leave more fragments. > It might take some tinkering to get the algorithm tuned right, but how > much tinkering depends on how much information you need to get from > the poorer matches. The single word fragments compare for content, the > double and triple word fragments would compare context. > Good luck, > Tim > > On Jun 19, 5:25 pm, PeteSalty <[email protected]> wrote: > > > This isn't really a Rails post but this group has given such great > > responses to a range of questions over the years I though I'd ask > > anyway. > > > I've been tasked with writting a Rails app that takes a block of text, > > anywhere from about 50 characters up to 300 characters - about a > > sentance or two, and compares it to other similar sized blocks of text > > and compares how similar they are, content wise and contextually. It > > doesn't have to be perfect but it has to be reasonably close. I was > > thinking that it would be good to be able to get a numerical score > > depending on how close they were (90 is really close, 20 is not very > > close at all) but I'm certainly open to ideas. > > > Anyway, the problem is I have no idea how to do this or even where to > > look to get started. I really doubt that there is already a Ruby > > library to do this (although that would rock) , or a Rails plug-in > > (although that would rock really hard) so I'm more looking for ideas > > on what I should be reading to get a sense on how to start on this. > > Anything would help, theoretical ideas, technical papers, Wikipedia > > articles, anything. > > > Anyway, any suggestions are greatly appreciated. > > Dale > > --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en -~----------~----~----~----~------~----~------~--~---

