I'll get more mesages into errors\spam right away and play with the maxbytes settings as suggested.
MaxCorrectedDays was 0 (so never delete right?). It's always been that way, intentionally. I manually edit as needed, but it trims to my 15k max. Somehow, MaxBayesFileAge was only 31. I am almost certain I've always had this as 0, then files are deleted during the rebuild process randomly to get a better sampling. Is that a bad strategy now? *Is there any chance that one of the new versions erroneously overwrites thisMaxBayesFileAge 0 value??* I certainly could be mistaken or maybe I somehow reset that to default. That at least explains why my nospam folder was sub 10k. Yes, most of our spam messages are very short. Is that unusual? It's always been true here at least. Thanks for the help!! On Tue, Dec 18, 2018 at 8:38 AM Thomas Eckardt <thomas.ecka...@thockar.com> wrote: > you need much more spam mails > > target should be to get a corpusnorm of 0.9 .... 1.1 after "info: > corpusnorm after processing messages/errors-spam and > messages/errors-notspam > > this will require an amount of ~ 5.000 spam mails in > messages/errors-spam ([1] moving *(e.g. older) well known spam* from > messages/spam to messages/errors-spam will help) > > It looks like most of your collected spam mails are very short. 16.000 > spam and 2.200 ham resulting in a corpunorm of 0.77 -> collect at least > 4.000 + [1] more spam mails. > > set MaxCorrectedDays very high (e.g. 10.000) - leave this for ever > > procedure: > > increase MaxFiles (e.g. to 30.000) > set the first value of MaxBayesFileAge much higher until the corpusnorm is > balanced (this will take some days) - than calculate the age of the oldest > spam and set the first value of MaxBayesFileAge accordingly - count the > files in messages/spam and set MaxFiles accordingly > If the corpusnorm is fine, leave the setting for some days (be patient > !!!!). > > Than increase MaxBytes to 8.000. This will lead in to a too low > corpusnorm. Start the above procedure again. > Than increase MaxBytes to 20.000. This will lead again in to a too low > corpusnorm. Start the above procedure again. > > Every some days check the rebuild log. Small corrections for > MaxBayesFileAge will help to keep everyting fine. Most times no correction > will be required. > If "info: corpusnorm after processing messages/errors-spam and > messages/errors-notspam..." becomes too unbalanced, correct the long time > corpus manually (move files)! > > Keep in mind: the rebuild task requires two runs after any of the above > value changes, to reach the auto-self-healthy-state! > > Thomas > > > > Von: "K Post" <nntp.p...@gmail.com> > An: "ASSP development mailing list" < > assp-test@lists.sourceforge.net> > Datum: 17.12.2018 16:05 > Betreff: [Assp-test] Rebuild only needs 1 file from notspam? > ------------------------------ > > > > I just reviewed a rebuild llog and was shocked to see: > Dec-17-18 02:25:25 info: require approximately 1 files (2 words) from > folder messages/notspam to get the wanted corpusnorm (1.000) > > That's after the messages/spam folder (15k messages) is processed. > I have maxfiles set to 15,000 > maxbytes set to 4,000 > > Suggestions? I certainly want our users' good mail to be considered! > Can't say I've seen this ever before, but I don't review the rebuild log > terribly often. > > Copy of rebuild log: > > > File rebuildrun.txt follows: > > > Dec-17-18 02:15:00 RebuildSpamDB-thread rebuildspamdb-version 7.50 started > in ASSP version 2.6.2(18339) > > Dec-17-18 02:15:00 RebuildSpamDB uses BerkeleyDB for temporary hashes > > Dec-17-18 02:15:00 RebuildSpamDB uses BerkeleyDB-ENV with 62.50 MByte > > Dec-17-18 02:15:00 RebuildSpamDB will create a Hidden Markov Model > > Dec-17-18 02:15:00 RebuildSpamDB will include attachment-database-entries > in to spamdb > > Dec-17-18 02:15:00 RebuildSpamDB will create unicode enabled databases > > Dec-17-18 02:15:00 RebuildSpamDB will process all words as Sequence of UAX > #29 Grapheme Clusters > > Dec-17-18 02:15:00 RebuildSpamDB will normalize unicode characters > > Dec-17-18 02:15:00 RebuildSpamDB will use the ASSP_WordStem engine > > Dec-17-18 02:15:00 ---ASSP Settings--- > Dec-17-18 02:15:00 Do Not Collect Messages with RedListed address: Enabled > **Messages with RedListed addresses will be removed from the corpus!** > > Dec-17-18 02:15:00 Do Not Collect RedRe Messages: Enabled **Messages > matching the RedRe will be removed from the corpus!** > > Dec-17-18 02:15:00 Use Subject as Maillog Names: True > Dec-17-18 02:15:00 Maxbytes: 4,000 > Dec-17-18 02:15:00 Maxfiles: 15,000 > Dec-17-18 02:15:00 RebuildFileTimeLimit: 1 5 > Dec-17-18 02:15:00 RebuildFileTimeLimit: files will be moved away from the > corpus if their processing takes longer than 5 second(s) > > Dec-17-18 02:15:00 Trashlist cleaning finished, 2 of 56 files deleted > > Dec-17-18 02:15:00 c:/ASSP/messages/errors-spam > Dec-17-18 02:15:00 File Count: 934 > Dec-17-18 02:15:00 Processing... messages/errors-spam with 934 files > Dec-17-18 02:15:52 0 attachment/image entries processed > Dec-17-18 02:15:52 Imported Files for HeloBlackList: 933 > Dec-17-18 02:15:52 Imported Files for Bayes/HMM: 933 > Dec-17-18 02:15:52 Finished in 52 seconds (17.94 files/s - 9.88 MByte) > > Dec-17-18 02:15:52 c:/ASSP/messages/errors-notspam > Dec-17-18 02:15:52 File Count: 2,209 > Dec-17-18 02:15:52 Processing... messages/errors-notspam with 2,209 files > Dec-17-18 02:18:36 0 attachment/image entries processed > Dec-17-18 02:18:36 Imported Files for HeloBlackList: 2,208 > Dec-17-18 02:18:36 Imported Files for Bayes/HMM: 2,208 > Dec-17-18 02:18:36 Finished in 164 seconds (13.46 files/s - 34.86 MByte) > Dec-17-18 02:18:36 info: corpusnorm after processing messages/errors-spam > and messages/errors-notspam is Spam Weight: 657272 / Not-Spam Weight: > 3563832 => norm: 0.184 > Dec-17-18 02:18:36 info: require approximately all files (2,061,306 words) > from folder messages/spam to get the wanted corpusnorm (1.000) > > Dec-17-18 02:18:36 c:/ASSP/messages/spam > Dec-17-18 02:18:36 File Count: 14,937 > Dec-17-18 02:18:36 Processing... messages/spam with 14,937 files > Dec-17-18 02:25:25 0 attachment/image entries processed > Dec-17-18 02:25:25 Imported Files for HeloBlackList: 14,937 > Dec-17-18 02:25:25 Imported Files for Bayes/HMM: 14,937 > Dec-17-18 02:25:25 Finished in 409 seconds (36.52 files/s - 69.05 MByte) > Dec-17-18 02:25:25 info: require approximately 1 files (2 words) from > folder messages/notspam to get the wanted corpusnorm (1.000) > > Dec-17-18 02:25:25 c:/ASSP/messages/notspam > Dec-17-18 02:25:25 File Count: 9,382 > Dec-17-18 02:25:25 Processing... messages/notspam with 9,382 files > Dec-17-18 02:26:42 0 attachment/image entries processed > Dec-17-18 02:26:42 Imported Files for HeloBlackList: 9,382 > Dec-17-18 02:26:42 Imported Files for Bayes/HMM: 0 > Dec-17-18 02:26:42 Finished in 77 seconds (121.84 files/s - 81.79 MByte) > > Dec-17-18 02:26:42 Generating weighted Bayesian tuplets > Dec-17-18 02:27:04 start populating Spamdb with 465,296 records - Bayesian > check is now disabled! > Dec-17-18 02:28:19 Finished populating Spamdb with 465,296 records - > Bayesian check is now enabled! > Dec-17-18 02:28:19 done - Generating weighted Bayesian tuplets > > Dec-17-18 02:28:19 Bayesian Pairs: 465,296 now in list > > Dec-17-18 02:28:19 Generating consolidated Hidden-Markov-Model database > from 2,155,159 record model > Dec-17-18 02:30:25 HMM sequences: 1,059,525 now in list > > Dec-17-18 02:30:26 generating Spamdb.helo records from 13,393 collected > HELO's > Dec-17-18 02:30:28 cleaning old Spamdb.helo records > Dec-17-18 02:30:28 done - cleaning old Spamdb.helo records > > Dec-17-18 02:30:28 HELO Blacklist: 25 new, 1,159 now in list > > Dec-17-18 02:30:28 Spam Weight : 2,745,357 > Dec-17-18 02:30:28 Not-Spam Weight: 3,563,832 > > Dec-17-18 02:30:28 Corpus norm: 0.7703 - (ok - slighly ham heavy) > Dec-17-18 02:30:28 Corpus confidence: 0.66134618 > > Dec-17-18 02:30:33 Start populating Hidden Markov Model. HMM-check is > disabled for this time! > Dec-17-18 02:30:33 start populating Hidden Markov Model with 1,059,525 > records! > Dec-17-18 02:33:08 Finished populating Hidden Markov Model with 1,059,525 > records! > Dec-17-18 02:33:08 Finished populating Hidden Markov Model. HMM-check is > now enabled again! > > Dec-17-18 02:33:08 Total processing time: 1,088 second(s) > > Dec-17-18 02:33:08 Total processing data: 195.58 MByte > > > Dec-17-18 02:33:08 Rebuild processed 39.12 files per second. > > Dec-17-18 02:33:08 After finishing the Rebuild process, the c:/ASSP/tmpDB > folder contains 363.74 MByte. > > Dec-17-18 02:33:08 After finishing the Rebuild process, the drive that > contains the c:/ASSP/tmpDB folder has 12.89 GByte free space from total > 25.20 GByte. > > Dec-17-18 02:33:08 building new GripList records and bounce report > Dec-17-18 02:33:08 processing Logfile c:/ASSP/logs/maillog.txt > Dec-17-18 02:33:08 processing Logfile c:/ASSP/logs/18-12-16.maillog.txt > Dec-17-18 02:33:15 processing Logfile c:/ASSP/logs/18-12-15.maillog.txt > Dec-17-18 02:33:20 processing Logfile c:/ASSP/logs/18-12-14.maillog.txt > Dec-17-18 02:33:28 processing Logfile c:/ASSP/logs/18-12-13.maillog.txt > Dec-17-18 02:33:29 processing Logfile c:/ASSP/logs/18-12-12.maillog.txt > > Dec-17-18 02:33:30 bounce report for the last two days: 11 bounces > received (possibly delayed) - 1 bounces blocked > > Dec-17-18 02:33:30 list of the top ten local addresses with blocked > bounces in the last two days: > > b...@ourcharity.org : 1 > > Dec-17-18 02:33:30 end of bounce report > > Dec-17-18 02:33:31 Uploading Griplist via Direct Connection > Dec-17-18 02:33:32 Submitted 6,144 bytes: 0 IPv6 addresses, 2,654 IPv4 > addresses, good IP's 811 , bad IP's 1,137 > > Dec-17-18 02:33:32 Trashlist was saved to c:/ASSP/trashlist.db > > > THANKS!!_______________________________________________ > Assp-test mailing list > Assp-test@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/assp-test > > > > > DISCLAIMER: > ******************************************************* > This email and any files transmitted with it may be confidential, legally > privileged and protected in law and are intended solely for the use of the > individual to whom it is addressed. > This email was multiple times scanned for viruses. There should be no > known virus in this email! > ******************************************************* > > _______________________________________________ > Assp-test mailing list > Assp-test@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/assp-test >
_______________________________________________ Assp-test mailing list Assp-test@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/assp-test