[pm-h] document similarity demo script related to lightning talk

B. Estrade via Houston Fri, 14 Nov 2014 14:04:15 -0800

First, I really enjoyed last night. I learned a lot of really cool things.
If you think what you don't have to say is of no interest, think again :)


Now, here is a more sophisticated method for determining the similarity
between any 2 give documents.  In the case of the script, I comparing a
sampling of eBay item titles. It is taken directly out of Section 5.7 of
Practical Text Mining With Perl. I just cleaned it up and modified it for
my purposes.

The result is a square matrix ( MxM given M documents) that relates all
"documents" to the other, the final value is a measure of similarity for 1
(exact) to 0.

https://github.com/estrabd/lightning-talks/tree/master/houston-pm-13-nov-2014-text-mining

I forgot to mention last night that the method uses what is called a "bag
of words" model - meaning that word order doesn't matter.  Word order may
be considered using "n-grams" - or strings of ordered words, and I imagine
the the same method may apply - it just greatly increases the number of
entries in each document vector.

There's a lot to this book, so maybe I'll have something interesting the
next time we do another round of these talks.

Brett

_______________________________________________
Houston mailing list
[email protected]
http://mail.pm.org/mailman/listinfo/houston
Website: http://houston.pm.org/

[pm-h] document similarity demo script related to lightning talk

Reply via email to