first: removing/moving files from the 'errors' corpus like you did, is one of the most stupid things I ever saw! You gave assp an apoplexy!
>How does it know what will be in spam and notspam. ASSP knows what was done in the rebuild task in the past and how it was done. If you do mayor changes manualy in the corpus - remove the file 'assp/normfile' and run the rebuild twice. The first to get a clue about what has been changed, the second to get the best detection rate. Thomas Von: K Post <nntp.p...@gmail.com> An: ASSP development mailing list <assp-test@lists.sourceforge.net> Datum: 30.04.2015 04:10 Betreff: [Assp-test] Rebuild not parsing everything I was seeing a very fast parsing of notspam, I believe because there was a lot of errors-notspam, but much less errors-spam. As a test, I moved all errors-spam to spam and errors-notspam to notspam, leaving 0 error reports. Here's the result of the rebuild: Apr-29-15 21:43:55 RebuildSpamDB-thread rebuildspamdb-version 7.10 started in ASSP version 2.4.4(15117) Apr-29-15 21:43:55 rebuild debug output is enabled to c:/assp/rebuilddebug.txt Apr-29-15 21:43:55 RebuildSpamDB uses BerkeleyDB for temporary hashes Apr-29-15 21:43:55 RebuildSpamDB uses BerkeleyDB-ENV with 62.50 MByte Apr-29-15 21:43:55 RebuildSpamDB will create a Hidden Markov Model! Apr-29-15 21:43:55 RebuildSpamDB will create unicode enabled databases. Apr-29-15 21:43:55 RebuildSpamDB will process all words as Sequence of UAX #29 Grapheme Clusters. Apr-29-15 21:43:55 RebuildSpamDB will normalize unicode characters. Apr-29-15 21:43:55 RebuildSpamDB will use the ASSP_WordStem engine. Apr-29-15 21:43:55 ---ASSP Settings--- Apr-29-15 21:43:55 Do Not Collect Messages with RedListed address: Enabled **Messages with RedListed addresses will be removed from the corpus!** Apr-29-15 21:43:55 Do Not Collect RedRe Messages: Enabled **Messages matching the RedRe will be removed from the corpus!** Apr-29-15 21:43:55 Use Subject as Maillog Names: True Apr-29-15 21:43:55 Maxbytes: 2,500 Apr-29-15 21:43:55 RebuildFileTimeLimit: 1 5 Apr-29-15 21:43:55 RebuildFileTimeLimit: files will be moved away from the corpus, if their processing takes longer than 5 second(s) Apr-29-15 21:44:02 Trashlist cleaning finished, 0 of 23606 files deleted Apr-29-15 21:44:02 c:/assp/messages/errors-spam Apr-29-15 21:44:02 File Count: 0 Apr-29-15 21:44:02 Processing... messages/errors-spam with 0 files Apr-29-15 21:44:02 Imported Files for HeloBlackList: 0 Apr-29-15 21:44:02 Imported Files for Bayes/HMM: 0 Apr-29-15 21:44:02 Finished in 1 second(s) Apr-29-15 21:44:02 c:/assp/messages/errors-notspam Apr-29-15 21:44:02 File Count: 0 Apr-29-15 21:44:02 Processing... messages/errors-notspam with 0 files Apr-29-15 21:44:02 Imported Files for HeloBlackList: 0 Apr-29-15 21:44:02 Imported Files for Bayes/HMM: 0 Apr-29-15 21:44:02 Finished in 1 second(s) Apr-29-15 21:44:02 info: corpusnorm after processing messages/errors-spam and messages/errors-notspam is Spam Weight: 0 / Not-Spam Weight: 0 => norm: 1.000 Apr-29-15 21:44:02 info: require apx. 2,812 files (360,000 words) from folder messages/spam to get the wanted corpusnorm (1.000) Apr-29-15 21:44:02 c:/assp/messages/spam Apr-29-15 21:44:02 File Count: 18,219 Apr-29-15 21:44:02 Processing... messages/spam with 15,000 files Apr-29-15 21:47:49 Imported Files for HeloBlackList: 15,000 Apr-29-15 21:47:49 Imported Files for Bayes/HMM: 1,888 Apr-29-15 21:47:49 Finished in 227 second(s) Apr-29-15 21:47:49 info: require apx. all files (360,036 words) from folder messages/notspam to get the wanted corpusnorm (1.000) Apr-29-15 21:47:49 c:/assp/messages/notspam Apr-29-15 21:47:49 File Count: 21,197 Apr-29-15 21:47:49 Processing... messages/notspam with 15,000 files Apr-29-15 21:52:06 Imported Files for HeloBlackList: 15,000 Apr-29-15 21:52:06 Imported Files for Bayes/HMM: 1,040 Apr-29-15 21:52:06 Finished in 257 second(s) Apr-29-15 21:52:06 Generating weighted Bayesian tuplets Apr-29-15 21:52:10 start populating Spamdb with 27,082 records - Bayesian check is now disabled! Apr-29-15 21:52:24 Finished populating Spamdb with 27,082 records - Bayesian check is now enabled! Apr-29-15 21:52:24 done - Generating weighted Bayesian tuplets Apr-29-15 21:52:24 Bayesian Pairs: 27,082 now in list Apr-29-15 21:52:24 Generating consolidated Hidden-Markov-Model database from 527,319 record model Apr-29-15 21:52:46 HMM sequences: 259,284 now in list Apr-29-15 21:52:46 generating Spamdb.helo records from 5,112 collected HELO's Apr-29-15 21:52:47 cleaning old Spamdb.helo records Apr-29-15 21:52:52 done - cleaning old Spamdb.helo records Apr-29-15 21:52:52 HELO Blacklist: 12 new, 427 now in list Apr-29-15 21:52:52 Spam Weight: 360,036 Apr-29-15 21:52:52 Not-Spam Weight: 360,070 Apr-29-15 21:52:52 Corpus norm: 0.9999 - (very good - balanced) Apr-29-15 21:52:52 Corpus confidence: 1.00000000 Apr-29-15 21:52:57 Start populating Hidden Markov Model. HMM-check is disabled for this time! Apr-29-15 21:53:01 start populating Hidden Markov Model with 259,284 records! Apr-29-15 21:53:06 Finished populating Hidden Markov Model with 259,284 records! Apr-29-15 21:53:06 Finished populating Hidden Markov Model. HMM-check is now enabled again! Apr-29-15 21:53:06 Total processing time: 551 second(s) Apr-29-15 21:53:06 Total processing data: 95.49 MByte Apr-29-15 21:53:06 Rebuild processed 61.73 files per second. Apr-29-15 21:53:06 After finishing the Rebuild process, the c:/assp/tmpDB folder contains 101.74 MByte. Apr-29-15 21:53:06 After finishing the Rebuild process, the drive that contains the c:/assp/tmpDB folder has 20.17 GByte free space from total 25.20 GByte. Why after processing errors-spam and errors-notspam does it say: Apr-29-15 21:44:02 info: require apx. 2,812 files (360,000 words) from folder messages/spam to get the wanted corpusnorm (1.000) How does it know what will be in spam and notspam. Shouldn't it parse all and then decide??? Based on the fast 4 minute scan time of each spam and not spam, I'm guessing it's not looking at all files. is that normal? Seems like a really small spamdb and hmm given 30k files (Even with only the first 2.5kb being looked at) ------------------------------------------------------------------------------ One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y _______________________________________________ Assp-test mailing list Assp-test@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/assp-test DISCLAIMER: ******************************************************* This email and any files transmitted with it may be confidential, legally privileged and protected in law and are intended solely for the use of the individual to whom it is addressed. This email was multiple times scanned for viruses. There should be no known virus in this email! ******************************************************* ------------------------------------------------------------------------------ One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y _______________________________________________ Assp-test mailing list Assp-test@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/assp-test