I was reading a document describing the so called "Locality Sensitive Hashing"
http://www.stanford.edu/class/cs345a/slides/05-LSH.pdf http://en.wikipedia.org/wiki/Locality-sensitive_hashing and, while going through it, I started thinking at the ASSP spam and notspam corpus; as it was discussed in a past, it's possible that a "flood" of similar spam or ham messages may somewhat unbalance the corpus, now, the above may represent a solution, in such a case, the rebuild may just "skip" messages if they're repeated too often, so allowing to keep the corpus balanced Not just that, while writing this I was also thinking at another possible usage for the above... let's say we receive a given email which, once processed by LSH (see above) has a given "bucket", let's also say that the email was recognized as "spam" (or ham, whatever); now, a second incoming mail hitting the same "LSH bucket" as the first one has quite high probabilities to bee "spam" (or ham) too, so the approach may also be used to help classifying messages ! ------------------------------------------------------------------------------ Try before you buy = See our experts in action! The most comprehensive online learning library for Microsoft developers is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, Metro Style Apps, more. Free future releases when you subscribe now! http://p.sf.net/sfu/learndevnow-dev2 _______________________________________________ Assp-test mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/assp-test
