Thanks Thomas, for the info and explanation, that makes sense.
One question though, I'm trying to understand the difference between spammy and hammy entries in the database, so I did the following query: assp=# select * from hmmdb where pkey like '%testosterone%'; pkey | pvalue | pfrozen ---------------------------------------------------------------------------- ---------+-----------+--------- testosterone\x1Cand\x1Cfeel\x1Cstrong\x1Cssub | 0.9999999 | 0 @macmedics.com\x1Cfree\x1Ctestosterone\x1Cand\x1Cfeel\x1Cstrong | 0.9999999 | 0 ssub\x1Cboost\x1Cfree\x1Ctestosterone\x1Cand | 0.9999999 | 0 @macmedics.com\x1Cssub\x1Cboost\x1Cfree\x1Ctestosterone\x1Cand | 0.9999999 | 0 @macmedics.com\x1Cboost\x1Cfree\x1Ctestosterone\x1Cand\x1Cfeel | 0.9999999 | 0 free\x1Ctestosterone\x1Cand\x1Cfeel\x1Cstrong | 0.9999999 | 0 @macmedics.com\x1C98d6915738f9d2b8e981c34b\x1Cssub\x1Cboost\x1Cfree\x1Ctesto sterone | 0.9999999 | 0 boost\x1Cfree\x1Ctestosterone\x1Cand\x1Cfeel | 0.9999999 | 0 98d6915738f9d2b8e981c34b\x1Cssub\x1Cboost\x1Cfree\x1Ctestosterone | 0.9999999 | 0 @macmedics.com\x1Ctestosterone\x1Cand\x1Cfeel\x1Cstrong\x1Cssub | 0.9999999 | 0 (10 rows) These spammy entries look identical to the disclaimers which you apparently were saying were corrected-notspam. Sorry, I apparently don't know enough Perl to figure out how the code is dealing with this, are these entries simply in a different section of the database or does each entry in fact contain enough info to identify whether it is a spam or ham word? When I just dump the database, spam and ham entries appear to be together so it appears to be the latter. Thanks again, - Phil Re: <https://sourceforge.net/p/assp/mailman/message/36699848/> [Assp-user] Disclaimers not being removed? From: Thomas Eckardt - 2019-06-22 07:16:34 Attachments: Message <https://sourceforge.net/p/assp/mailman/attachment/tITC.207679798c.OF3B47793 A.C5717FD8-ONC1258421.0023541F-C1258421.0027F2FD%40thockar.com/1/> as HTML >I also noticed the regex had truncated words in some but not all cases so I fixed that ASSP_WordStem.pm is installed and used -> word stemming is done and stop-words are removed. Any try to "fix" this, is wrong! If the disclamer is not stemmed in the mail - another language was detected for the mail. There is nothing you can (and should) fix. The disclamer-definition and every mail are processed as follows: - remove all special characters and spaces - detect the language - stem all words according to the detected language Another way to make sure the disclamer is ignored by assp, is to compose one or more faked mails, which contains only disclaimers (possibly multiple times). Put them in the oposit correction folder. companyname\x1Cis\x1Can\x1Ciphone\x1Cpowered | 0.9999999 | 0 (here this would be corrected-notspam) Make sure the MD5 hash of the body is different in all these mails. Remove the disclamer-definition. The discalimer content will get a weight of 0.4<>0.6 and will not be stored in the databases. Or it will get a weight <=0.4 and will be detected as good. Thomas --- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus
_______________________________________________ Assp-user mailing list Assp-user@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/assp-user