you need much more spam mails

target should be to get a corpusnorm of 0.9 .... 1.1 after "info: 
corpusnorm after processing messages/errors-spam and 
messages/errors-notspam 

this will require an amount of  ~ 5.000 spam mails in messages/errors-spam 
   ([1] moving (e.g. older) well known spam from messages/spam to 
messages/errors-spam will help)

It looks like most of your collected spam mails are very short. 16.000 
spam and 2.200 ham resulting in a corpunorm of 0.77 -> collect at least 
4.000 + [1] more spam mails.

set MaxCorrectedDays very high (e.g. 10.000) - leave this for ever

procedure:

increase MaxFiles (e.g. to 30.000)
set the first value of MaxBayesFileAge much higher until the corpusnorm is 
balanced (this will take some days) - than calculate the age of the oldest 
spam and set the first value of MaxBayesFileAge accordingly - count the 
files in messages/spam and set MaxFiles accordingly
If the corpusnorm is fine, leave the setting for some days (be patient 
!!!!). 

Than increase MaxBytes to 8.000. This will lead in to a too low 
corpusnorm. Start the above procedure again.
Than increase MaxBytes to 20.000. This will lead again in to a too low 
corpusnorm. Start the above procedure again.

Every some days check the rebuild log. Small corrections for 
MaxBayesFileAge will help to keep everyting fine. Most times no correction 
will be required.
If   "info: corpusnorm after processing messages/errors-spam and 
messages/errors-notspam..." becomes too unbalanced, correct the long time 
corpus manually (move files)!

Keep in mind: the rebuild task requires two runs after any of the above 
value changes, to reach the auto-self-healthy-state!

Thomas



Von:    "K Post" <nntp.p...@gmail.com>
An:     "ASSP development mailing list" <assp-test@lists.sourceforge.net>
Datum:  17.12.2018 16:05
Betreff:        [Assp-test] Rebuild only needs 1 file from notspam?



I just reviewed a rebuild llog and was shocked to see:
Dec-17-18 02:25:25 info: require approximately 1 files (2 words) from 
folder messages/notspam to get the wanted corpusnorm (1.000)

That's after the messages/spam folder (15k messages) is processed.  
I have maxfiles set to 15,000
maxbytes set to 4,000

Suggestions?  I certainly want our users' good mail to be considered!  
Can't say I've seen this ever before, but I don't review the rebuild log 
terribly often.

Copy of rebuild log:


File rebuildrun.txt follows:


Dec-17-18 02:15:00 RebuildSpamDB-thread rebuildspamdb-version 7.50 started 
in ASSP version 2.6.2(18339)

Dec-17-18 02:15:00 RebuildSpamDB uses BerkeleyDB for temporary hashes

Dec-17-18 02:15:00 RebuildSpamDB uses BerkeleyDB-ENV with 62.50 MByte

Dec-17-18 02:15:00 RebuildSpamDB will create a Hidden Markov Model

Dec-17-18 02:15:00 RebuildSpamDB will include attachment-database-entries 
in to spamdb

Dec-17-18 02:15:00 RebuildSpamDB will create unicode enabled databases

Dec-17-18 02:15:00 RebuildSpamDB will process all words as Sequence of UAX 
#29 Grapheme Clusters

Dec-17-18 02:15:00 RebuildSpamDB will normalize unicode characters

Dec-17-18 02:15:00 RebuildSpamDB will use the ASSP_WordStem engine

Dec-17-18 02:15:00 ---ASSP Settings---
Dec-17-18 02:15:00 Do Not Collect Messages with RedListed address: Enabled 
**Messages with RedListed addresses will be removed from the corpus!**

Dec-17-18 02:15:00 Do Not Collect RedRe Messages: Enabled **Messages 
matching the RedRe will be removed from the corpus!**

Dec-17-18 02:15:00 Use Subject as Maillog Names: True
Dec-17-18 02:15:00 Maxbytes: 4,000
Dec-17-18 02:15:00 Maxfiles: 15,000
Dec-17-18 02:15:00 RebuildFileTimeLimit: 1 5
Dec-17-18 02:15:00 RebuildFileTimeLimit: files will be moved away from the 
corpus if their processing takes longer than 5 second(s) 

Dec-17-18 02:15:00 Trashlist cleaning finished, 2 of 56 files deleted

Dec-17-18 02:15:00 c:/ASSP/messages/errors-spam
Dec-17-18 02:15:00 File Count: 934
Dec-17-18 02:15:00 Processing... messages/errors-spam with 934 files
Dec-17-18 02:15:52 0 attachment/image entries processed
Dec-17-18 02:15:52 Imported Files for HeloBlackList: 933
Dec-17-18 02:15:52 Imported Files for Bayes/HMM: 933
Dec-17-18 02:15:52 Finished in 52 seconds (17.94 files/s - 9.88 MByte)

Dec-17-18 02:15:52 c:/ASSP/messages/errors-notspam
Dec-17-18 02:15:52 File Count: 2,209
Dec-17-18 02:15:52 Processing... messages/errors-notspam with 2,209 files
Dec-17-18 02:18:36 0 attachment/image entries processed
Dec-17-18 02:18:36 Imported Files for HeloBlackList: 2,208
Dec-17-18 02:18:36 Imported Files for Bayes/HMM: 2,208
Dec-17-18 02:18:36 Finished in 164 seconds (13.46 files/s - 34.86 MByte)
Dec-17-18 02:18:36 info: corpusnorm after processing messages/errors-spam 
and messages/errors-notspam is Spam Weight: 657272 / Not-Spam Weight: 
3563832 => norm: 0.184
Dec-17-18 02:18:36 info: require approximately all files (2,061,306 words) 
from folder messages/spam to get the wanted corpusnorm (1.000)

Dec-17-18 02:18:36 c:/ASSP/messages/spam
Dec-17-18 02:18:36 File Count: 14,937
Dec-17-18 02:18:36 Processing... messages/spam with 14,937 files
Dec-17-18 02:25:25 0 attachment/image entries processed
Dec-17-18 02:25:25 Imported Files for HeloBlackList: 14,937
Dec-17-18 02:25:25 Imported Files for Bayes/HMM: 14,937
Dec-17-18 02:25:25 Finished in 409 seconds (36.52 files/s - 69.05 MByte)
Dec-17-18 02:25:25 info: require approximately 1 files (2 words) from 
folder messages/notspam to get the wanted corpusnorm (1.000)

Dec-17-18 02:25:25 c:/ASSP/messages/notspam
Dec-17-18 02:25:25 File Count: 9,382
Dec-17-18 02:25:25 Processing... messages/notspam with 9,382 files
Dec-17-18 02:26:42 0 attachment/image entries processed
Dec-17-18 02:26:42 Imported Files for HeloBlackList: 9,382
Dec-17-18 02:26:42 Imported Files for Bayes/HMM: 0
Dec-17-18 02:26:42 Finished in 77 seconds (121.84 files/s - 81.79 MByte)

Dec-17-18 02:26:42 Generating weighted Bayesian tuplets
Dec-17-18 02:27:04 start populating Spamdb with 465,296 records - Bayesian 
check is now disabled!
Dec-17-18 02:28:19 Finished populating Spamdb with 465,296 records - 
Bayesian check is now enabled!
Dec-17-18 02:28:19 done - Generating weighted Bayesian tuplets

Dec-17-18 02:28:19 Bayesian Pairs: 465,296 now in list

Dec-17-18 02:28:19 Generating consolidated Hidden-Markov-Model database 
from 2,155,159 record model
Dec-17-18 02:30:25 HMM sequences: 1,059,525 now in list

Dec-17-18 02:30:26 generating Spamdb.helo records from 13,393 collected 
HELO's
Dec-17-18 02:30:28 cleaning old Spamdb.helo records
Dec-17-18 02:30:28 done - cleaning old Spamdb.helo records

Dec-17-18 02:30:28 HELO Blacklist: 25 new, 1,159 now in list

Dec-17-18 02:30:28 Spam Weight    :   2,745,357
Dec-17-18 02:30:28 Not-Spam Weight:   3,563,832

Dec-17-18 02:30:28 Corpus norm: 0.7703 - (ok - slighly ham heavy)
Dec-17-18 02:30:28 Corpus confidence: 0.66134618

Dec-17-18 02:30:33 Start populating Hidden Markov Model. HMM-check is 
disabled for this time!
Dec-17-18 02:30:33 start populating Hidden Markov Model with 1,059,525 
records!
Dec-17-18 02:33:08 Finished populating Hidden Markov Model with 1,059,525 
records!
Dec-17-18 02:33:08 Finished populating Hidden Markov Model. HMM-check is 
now enabled again!

Dec-17-18 02:33:08 Total processing time: 1,088 second(s)

Dec-17-18 02:33:08 Total processing data: 195.58 MByte


Dec-17-18 02:33:08 Rebuild processed 39.12 files per second.

Dec-17-18 02:33:08 After finishing the Rebuild process, the c:/ASSP/tmpDB 
folder contains 363.74 MByte.

Dec-17-18 02:33:08 After finishing the Rebuild process, the drive that 
contains the c:/ASSP/tmpDB folder has 12.89 GByte free space from total 
25.20 GByte.

Dec-17-18 02:33:08 building new GripList records and bounce report
Dec-17-18 02:33:08 processing Logfile c:/ASSP/logs/maillog.txt
Dec-17-18 02:33:08 processing Logfile c:/ASSP/logs/18-12-16.maillog.txt
Dec-17-18 02:33:15 processing Logfile c:/ASSP/logs/18-12-15.maillog.txt
Dec-17-18 02:33:20 processing Logfile c:/ASSP/logs/18-12-14.maillog.txt
Dec-17-18 02:33:28 processing Logfile c:/ASSP/logs/18-12-13.maillog.txt
Dec-17-18 02:33:29 processing Logfile c:/ASSP/logs/18-12-12.maillog.txt

Dec-17-18 02:33:30 bounce report for the last two days: 11 bounces 
received (possibly delayed) - 1 bounces blocked

Dec-17-18 02:33:30 list of the top ten local addresses with blocked 
bounces in the last two days:

 b...@ourcharity.org : 1

Dec-17-18 02:33:30 end of bounce report

Dec-17-18 02:33:31 Uploading Griplist via Direct Connection
Dec-17-18 02:33:32 Submitted 6,144 bytes: 0 IPv6 addresses, 2,654 IPv4 
addresses, good IP's 811 , bad IP's 1,137

Dec-17-18 02:33:32 Trashlist was saved to c:/ASSP/trashlist.db


THANKS!!_______________________________________________
Assp-test mailing list
Assp-test@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/assp-test




DISCLAIMER:
*******************************************************
This email and any files transmitted with it may be confidential, legally 
privileged and protected in law and are intended solely for the use of the 

individual to whom it is addressed.
This email was multiple times scanned for viruses. There should be no 
known virus in this email!
*******************************************************

_______________________________________________
Assp-test mailing list
Assp-test@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/assp-test

Reply via email to