First, I really enjoyed last night. I learned a lot of really cool things. If you think what you don't have to say is of no interest, think again :)
Now, here is a more sophisticated method for determining the similarity between any 2 give documents. In the case of the script, I comparing a sampling of eBay item titles. It is taken directly out of Section 5.7 of Practical Text Mining With Perl. I just cleaned it up and modified it for my purposes. The result is a square matrix ( MxM given M documents) that relates all "documents" to the other, the final value is a measure of similarity for 1 (exact) to 0. https://github.com/estrabd/lightning-talks/tree/master/houston-pm-13-nov-2014-text-mining I forgot to mention last night that the method uses what is called a "bag of words" model - meaning that word order doesn't matter. Word order may be considered using "n-grams" - or strings of ordered words, and I imagine the the same method may apply - it just greatly increases the number of entries in each document vector. There's a lot to this book, so maybe I'll have something interesting the next time we do another round of these talks. Brett
_______________________________________________ Houston mailing list [email protected] http://mail.pm.org/mailman/listinfo/houston Website: http://houston.pm.org/
