[Assp-test] How to avoid multiple duplicates in corpus

Grayhat Mon, 05 Mar 2012 02:55:41 -0800

I was reading a document describing the so called "Locality Sensitive
Hashing"


http://www.stanford.edu/class/cs345a/slides/05-LSH.pdf

http://en.wikipedia.org/wiki/Locality-sensitive_hashing

and, while going through it, I started thinking at the ASSP spam and
notspam corpus; as it was discussed in a past, it's possible that a
"flood" of similar spam or ham messages may somewhat unbalance the
corpus, now, the above may represent a solution, in such a case, the
rebuild may just "skip" messages if they're repeated too often, so
allowing to keep the corpus balanced

Not just that, while writing this I was also thinking at another
possible usage for the above... let's say we receive a given email
which, once processed by LSH (see above) has a given "bucket", let's
also say that the email was recognized as "spam" (or ham, whatever);
now, a second incoming mail hitting the same "LSH bucket" as the first
one has quite high probabilities to bee "spam" (or ham) too, so the
approach may also be used to help classifying messages !




------------------------------------------------------------------------------
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2
_______________________________________________
Assp-test mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/assp-test

[Assp-test] How to avoid multiple duplicates in corpus

Reply via email to