The rebuild ignores equal messages in 'spam' and 'notspam' for years now.

Similar mails are 100% deteted by HMM as long as some of them where 
detected or one was reported at a earlyer time . And it is impossible that 
similar (even thousands) mails are able to compromize the corpusnorm. And 
the BayesWeight and HMM-result are limited to 1 even if one message (or 
similar) are received multiple times.

To keep your corpus uptodate run the rebuild more than one time a day.

LSH is not needed in ASSP - or IMHO very much better implemented (HMM 
level 5).

Thomas




Von:    Grayhat <[email protected]>
An:     [email protected]
Kopie:  MVPS Admins <[email protected]>
Datum:  05.03.2012 11:55
Betreff:        [Assp-test] How to avoid multiple duplicates in corpus




I was reading a document describing the so called "Locality Sensitive
Hashing" 

http://www.stanford.edu/class/cs345a/slides/05-LSH.pdf

http://en.wikipedia.org/wiki/Locality-sensitive_hashing

and, while going through it, I started thinking at the ASSP spam and
notspam corpus; as it was discussed in a past, it's possible that a
"flood" of similar spam or ham messages may somewhat unbalance the
corpus, now, the above may represent a solution, in such a case, the
rebuild may just "skip" messages if they're repeated too often, so
allowing to keep the corpus balanced

Not just that, while writing this I was also thinking at another
possible usage for the above... let's say we receive a given email
which, once processed by LSH (see above) has a given "bucket", let's
also say that the email was recognized as "spam" (or ham, whatever);
now, a second incoming mail hitting the same "LSH bucket" as the first
one has quite high probabilities to bee "spam" (or ham) too, so the
approach may also be used to help classifying messages !




------------------------------------------------------------------------------
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2
_______________________________________________
Assp-test mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/assp-test




DISCLAIMER:
*******************************************************
This email and any files transmitted with it may be confidential, legally 
privileged and protected in law and are intended solely for the use of the 

individual to whom it is addressed.
This email was multiple times scanned for viruses. There should be no 
known virus in this email!
*******************************************************


------------------------------------------------------------------------------
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2
_______________________________________________
Assp-test mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/assp-test

Reply via email to