I was seeing a very fast parsing of notspam, I believe because there was a lot of errors-notspam, but much less errors-spam.
As a test, I moved all errors-spam to spam and errors-notspam to notspam, leaving 0 error reports. Here's the result of the rebuild: Apr-29-15 21:43:55 RebuildSpamDB-thread rebuildspamdb-version 7.10 started in ASSP version 2.4.4(15117) Apr-29-15 21:43:55 rebuild debug output is enabled to c:/assp/rebuilddebug.txt Apr-29-15 21:43:55 RebuildSpamDB uses BerkeleyDB for temporary hashes Apr-29-15 21:43:55 RebuildSpamDB uses BerkeleyDB-ENV with 62.50 MByte Apr-29-15 21:43:55 RebuildSpamDB will create a Hidden Markov Model! Apr-29-15 21:43:55 RebuildSpamDB will create unicode enabled databases. Apr-29-15 21:43:55 RebuildSpamDB will process all words as Sequence of UAX #29 Grapheme Clusters. Apr-29-15 21:43:55 RebuildSpamDB will normalize unicode characters. Apr-29-15 21:43:55 RebuildSpamDB will use the ASSP_WordStem engine. Apr-29-15 21:43:55 ---ASSP Settings--- Apr-29-15 21:43:55 Do Not Collect Messages with RedListed address: Enabled **Messages with RedListed addresses will be removed from the corpus!** Apr-29-15 21:43:55 Do Not Collect RedRe Messages: Enabled **Messages matching the RedRe will be removed from the corpus!** Apr-29-15 21:43:55 Use Subject as Maillog Names: True Apr-29-15 21:43:55 Maxbytes: 2,500 Apr-29-15 21:43:55 RebuildFileTimeLimit: 1 5 Apr-29-15 21:43:55 RebuildFileTimeLimit: files will be moved away from the corpus, if their processing takes longer than 5 second(s) Apr-29-15 21:44:02 Trashlist cleaning finished, 0 of 23606 files deleted Apr-29-15 21:44:02 c:/assp/messages/errors-spam Apr-29-15 21:44:02 File Count: 0 Apr-29-15 21:44:02 Processing... messages/errors-spam with 0 files Apr-29-15 21:44:02 Imported Files for HeloBlackList: 0 Apr-29-15 21:44:02 Imported Files for Bayes/HMM: 0 Apr-29-15 21:44:02 Finished in 1 second(s) Apr-29-15 21:44:02 c:/assp/messages/errors-notspam Apr-29-15 21:44:02 File Count: 0 Apr-29-15 21:44:02 Processing... messages/errors-notspam with 0 files Apr-29-15 21:44:02 Imported Files for HeloBlackList: 0 Apr-29-15 21:44:02 Imported Files for Bayes/HMM: 0 Apr-29-15 21:44:02 Finished in 1 second(s) Apr-29-15 21:44:02 info: corpusnorm after processing messages/errors-spam and messages/errors-notspam is Spam Weight: 0 / Not-Spam Weight: 0 => norm: 1.000 Apr-29-15 21:44:02 info: require apx. 2,812 files (360,000 words) from folder messages/spam to get the wanted corpusnorm (1.000) Apr-29-15 21:44:02 c:/assp/messages/spam Apr-29-15 21:44:02 File Count: 18,219 Apr-29-15 21:44:02 Processing... messages/spam with 15,000 files Apr-29-15 21:47:49 Imported Files for HeloBlackList: 15,000 Apr-29-15 21:47:49 Imported Files for Bayes/HMM: 1,888 Apr-29-15 21:47:49 Finished in 227 second(s) Apr-29-15 21:47:49 info: require apx. all files (360,036 words) from folder messages/notspam to get the wanted corpusnorm (1.000) Apr-29-15 21:47:49 c:/assp/messages/notspam Apr-29-15 21:47:49 File Count: 21,197 Apr-29-15 21:47:49 Processing... messages/notspam with 15,000 files Apr-29-15 21:52:06 Imported Files for HeloBlackList: 15,000 Apr-29-15 21:52:06 Imported Files for Bayes/HMM: 1,040 Apr-29-15 21:52:06 Finished in 257 second(s) Apr-29-15 21:52:06 Generating weighted Bayesian tuplets Apr-29-15 21:52:10 start populating Spamdb with 27,082 records - Bayesian check is now disabled! Apr-29-15 21:52:24 Finished populating Spamdb with 27,082 records - Bayesian check is now enabled! Apr-29-15 21:52:24 done - Generating weighted Bayesian tuplets Apr-29-15 21:52:24 Bayesian Pairs: 27,082 now in list Apr-29-15 21:52:24 Generating consolidated Hidden-Markov-Model database from 527,319 record model Apr-29-15 21:52:46 HMM sequences: 259,284 now in list Apr-29-15 21:52:46 generating Spamdb.helo records from 5,112 collected HELO's Apr-29-15 21:52:47 cleaning old Spamdb.helo records Apr-29-15 21:52:52 done - cleaning old Spamdb.helo records Apr-29-15 21:52:52 HELO Blacklist: 12 new, 427 now in list Apr-29-15 21:52:52 Spam Weight: 360,036 Apr-29-15 21:52:52 Not-Spam Weight: 360,070 Apr-29-15 21:52:52 Corpus norm: 0.9999 - (very good - balanced) Apr-29-15 21:52:52 Corpus confidence: 1.00000000 Apr-29-15 21:52:57 Start populating Hidden Markov Model. HMM-check is disabled for this time! Apr-29-15 21:53:01 start populating Hidden Markov Model with 259,284 records! Apr-29-15 21:53:06 Finished populating Hidden Markov Model with 259,284 records! Apr-29-15 21:53:06 Finished populating Hidden Markov Model. HMM-check is now enabled again! Apr-29-15 21:53:06 Total processing time: 551 second(s) Apr-29-15 21:53:06 Total processing data: 95.49 MByte Apr-29-15 21:53:06 Rebuild processed 61.73 files per second. Apr-29-15 21:53:06 After finishing the Rebuild process, the c:/assp/tmpDB folder contains 101.74 MByte. Apr-29-15 21:53:06 After finishing the Rebuild process, the drive that contains the c:/assp/tmpDB folder has 20.17 GByte free space from total 25.20 GByte. Why after processing errors-spam and errors-notspam does it say: Apr-29-15 21:44:02 info: require apx. 2,812 files (360,000 words) from folder messages/spam to get the wanted corpusnorm (1.000) How does it know what will be in spam and notspam. Shouldn't it parse all and then decide??? Based on the fast 4 minute scan time of each spam and not spam, I'm guessing it's not looking at all files. is that normal? Seems like a really small spamdb and hmm given 30k files (Even with only the first 2.5kb being looked at) ------------------------------------------------------------------------------ One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y _______________________________________________ Assp-test mailing list Assp-test@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/assp-test