Thomas, Agreed, but the reason I analyzed the database in the first place was because the Bayes/HMM output was picking up the disclaimers as being spammy, likely due to their being present in errors/spam reports. It’s definitely much better now though and detecting as not spam.
I can certainly compose some fake mails to put into the corrected-notspam corpus if this again becomes a problem but just as an FYI, there is still some spammy Bayesian output related to the disclaimers, so the disclaimer removal process doesn’t seem to quite get them all: Bayesian Analysis: - word stemming engine is used - language italian(text) detected <javascript:void(0);> Bad Words Bad Prob Good Words Good Prob helo pv50p00im-ztdgrandword.me 0.0370 [addr] ssub 0.0370 pv50p00im-ztdgrandword.me rcpt 0.0370 sender [addr] 0.0370 ssub test 0.1273 is an 0.7529 company find 0.7423 an iphon 0.7312 www domain 0.6286 us at 0.6267 at href 0.6164 powered company 0.6164 companyname is 0.6143 domain mob 0.6000 iphon powered 0.5942 HMM Analysis: <javascript:void(0);> Bad Sequences Bad Prob Good Sequences Good Prob rcpt [addr] sender [addr] ssub 0.0000 * [addr] sender [addr] ssub test 0.0000 * pv50p00im-ztdgrandword.me rcpt [addr] sender [addr] 0.0000 * helo pv50p00im-ztdgrandword.me rcpt [addr] sender 0.0000 * powered company find us at 0.6000 company find us at href 0.6000 Bayesian Spam Probability: doubtful NOT SPAM spam-probability: 3.2086685e-09 ham-probability: 1.6400847e-05 combined probability: 0.00019560 - got 15 - used 15 most significant results answer/query relation: 71% of 21 bayesian confidence: 0.00000744 corpus confidence: 0.88889008 Values marked with an *, are irrelevant for the confidence calculation. Hidden-Markov-Model Spam Probability: confident NOT SPAM spam-probability: 3.6e-29 ham-probability: 0.15999994 combined probability: 0.00000000 - got 6 - used 6 most significant results HMM confidence: 0.01975311 answer/query relation: 33% of 18 corpus confidence: 0.88889008 Values marked with an *, are irrelevant for the confidence calculation. HMM and Bayesian Log: Jun-27-19 12:49:10 [Main_Thread] HMM Check [scoring] - Prob: 0.00000 - Confidence: 0.01975 => confident.ham - answer/query relation: 33% of 18 Let me take a moment to say though, that this is without a doubt, THE GREATEST SPAM FILTER EVER!!! The large number of multiple checks and the configurability allow you to tweak the spammers into submission. I’ve been making use of this project for years, first on Windows and then on Linux server platforms (we become wiser as we age), with results that are simply amazing. THANK YOU! And you too Fritz, may you Rest in Peace. Phil Quesinberry Q Systems Engineering, Inc. Embedded Systems, Telecom, IT (410) 969-8002 Ext.102 http://www.qsystemsengineering.com <http://www.qsystemsengineering.com/> From: Thomas Eckardt Sent: Sunday, June 23, 2019 3:54 AM To: For Users of ASSP Subject: Re: [Assp-user] Disclaimers not being removed? These are spam entries (> 0.6). To correct them - put the content in to the correctednotspam folder. Analyzing the database makes IMHO no sense. Instead analyze emails and check that there is no bayes and hmm output related to the disclaimer. Thomas Von: "Phil Quesinberry" <pques...@qsystemsengineering.com> An: "'For Users of ASSP'" <assp-user@lists.sourceforge.net> Datum: 23.06.2019 04:13 Betreff: Re: [Assp-user] Disclaimers not being removed? Thanks Thomas, for the info and explanation, that makes sense. One question though, I’m trying to understand the difference between spammy and hammy entries in the database, so I did the following query: assp=# select * from hmmdb where pkey like '%testosterone%'; pkey | pvalue | pfrozen -------------------------------------------------------------------------------------+-----------+--------- testosterone\x1Cand\x1Cfeel\x1Cstrong\x1Cssub | 0.9999999 | 0 @domain.com\x1Cfree\x1Ctestosterone\x1Cand\x1Cfeel\x1Cstrong | 0.9999999 | 0 ssub\x1Cboost\x1Cfree\x1Ctestosterone\x1Cand | 0.9999999 | 0 @domain.com\x1Cssub\x1Cboost\x1Cfree\x1Ctestosterone\x1Cand | 0.9999999 | 0 @domain.com\x1Cboost\x1Cfree\x1Ctestosterone\x1Cand\x1Cfeel | 0.9999999 | 0 free\x1Ctestosterone\x1Cand\x1Cfeel\x1Cstrong | 0.9999999 | 0 @domain.com\x1C98d6915738f9d2b8e981c34b\x1Cssub\x1Cboost\x1Cfree\x1Ctestosterone | 0.9999999 | 0 boost\x1Cfree\x1Ctestosterone\x1Cand\x1Cfeel | 0.9999999 | 0 98d6915738f9d2b8e981c34b\x1Cssub\x1Cboost\x1Cfree\x1Ctestosterone | 0.9999999 | 0 @domain.com\x1Ctestosterone\x1Cand\x1Cfeel\x1Cstrong\x1Cssub | 0.9999999 | 0 (10 rows) These spammy entries look identical to the disclaimers which you apparently were saying were corrected-notspam. Sorry, I apparently don’t know enough Perl to figure out how the code is dealing with this, are these entries simply in a different section of the database or does each entry in fact contain enough info to identify whether it is a spam or ham word? When I just dump the database, spam and ham entries appear to be together so it appears to be the latter. Thanks again, - Phil Re: [Assp-user] Disclaimers not being removed? <https://sourceforge.net/p/assp/mailman/message/36699848/> From: Thomas Eckardt - 2019-06-22 07:16:34 Attachments: Message as HTML <https://sourceforge.net/p/assp/mailman/attachment/tITC.207679798c.OF3B47793A.C5717FD8-ONC1258421.0023541F-C1258421.0027F2FD%40thockar.com/1/> >I also noticed the regex had truncated words in some but not all cases so I fixed that ASSP_WordStem.pm is installed and used -> word stemming is done and stop-words are removed. Any try to "fix" this, is wrong! If the disclamer is not stemmed in the mail - another language was detected for the mail. There is nothing you can (and should) fix. The disclamer-definition and every mail are processed as follows: - remove all special characters and spaces - detect the language - stem all words according to the detected language Another way to make sure the disclamer is ignored by assp, is to compose one or more faked mails, which contains only disclaimers (possibly multiple times). Put them in the oposit correction folder. companyname\x1Cis\x1Can\x1Ciphone\x1Cpowered | 0.9999999 | 0 (here this would be corrected-notspam) Make sure the MD5 hash of the body is different in all these mails. Remove the disclamer-definition. The discalimer content will get a weight of 0.4<>0.6 and will not be stored in the databases. Or it will get a weight <=0.4 and will be detected as good. Thomas Virus-free. www.avast.com <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient&utm_term=link> _______________________________________________ Assp-user mailing list Assp-user@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/assp-user DISCLAIMER: ******************************************************* This email and any files transmitted with it may be confidential, legally privileged and protected in law and are intended solely for the use of the individual to whom it is addressed. This email was multiple times scanned for viruses. There should be no known virus in this email! ******************************************************* --- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus
_______________________________________________ Assp-user mailing list Assp-user@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/assp-user