I was seeing a very fast parsing of notspam, I believe because there was a
lot of errors-notspam, but much less errors-spam.

As a test, I moved all errors-spam to spam and errors-notspam to notspam,
leaving 0 error reports.

Here's the result of the rebuild:


Apr-29-15 21:43:55 RebuildSpamDB-thread rebuildspamdb-version 7.10 started
in ASSP version 2.4.4(15117)

Apr-29-15 21:43:55 rebuild debug output is enabled to
c:/assp/rebuilddebug.txt

Apr-29-15 21:43:55 RebuildSpamDB uses BerkeleyDB for temporary hashes

Apr-29-15 21:43:55 RebuildSpamDB uses BerkeleyDB-ENV with 62.50 MByte

Apr-29-15 21:43:55 RebuildSpamDB will create a Hidden Markov Model!

Apr-29-15 21:43:55 RebuildSpamDB will create unicode enabled databases.

Apr-29-15 21:43:55 RebuildSpamDB will process all words as Sequence of UAX
#29 Grapheme Clusters.

Apr-29-15 21:43:55 RebuildSpamDB will normalize unicode characters.

Apr-29-15 21:43:55 RebuildSpamDB will use the ASSP_WordStem engine.

Apr-29-15 21:43:55 ---ASSP Settings---
Apr-29-15 21:43:55 Do Not Collect Messages with RedListed address: Enabled
**Messages with RedListed addresses will be removed from the corpus!**

Apr-29-15 21:43:55 Do Not Collect RedRe Messages: Enabled **Messages
matching the RedRe will be removed from the corpus!**

Apr-29-15 21:43:55 Use Subject as Maillog Names: True
Apr-29-15 21:43:55 Maxbytes: 2,500
Apr-29-15 21:43:55 RebuildFileTimeLimit: 1 5
Apr-29-15 21:43:55 RebuildFileTimeLimit: files will be moved away from the
corpus, if their processing takes longer than 5 second(s)

Apr-29-15 21:44:02 Trashlist cleaning finished, 0 of 23606 files deleted

Apr-29-15 21:44:02 c:/assp/messages/errors-spam
Apr-29-15 21:44:02 File Count: 0
Apr-29-15 21:44:02 Processing... messages/errors-spam with 0 files
Apr-29-15 21:44:02 Imported Files for HeloBlackList: 0
Apr-29-15 21:44:02 Imported Files for Bayes/HMM: 0
Apr-29-15 21:44:02 Finished in 1 second(s)

Apr-29-15 21:44:02 c:/assp/messages/errors-notspam
Apr-29-15 21:44:02 File Count: 0
Apr-29-15 21:44:02 Processing... messages/errors-notspam with 0 files
Apr-29-15 21:44:02 Imported Files for HeloBlackList: 0
Apr-29-15 21:44:02 Imported Files for Bayes/HMM: 0
Apr-29-15 21:44:02 Finished in 1 second(s)
Apr-29-15 21:44:02 info: corpusnorm after processing messages/errors-spam
and messages/errors-notspam is Spam Weight: 0 / Not-Spam Weight: 0 => norm:
1.000
Apr-29-15 21:44:02 info: require apx. 2,812 files (360,000 words) from
folder messages/spam to get the wanted corpusnorm (1.000)

Apr-29-15 21:44:02 c:/assp/messages/spam
Apr-29-15 21:44:02 File Count: 18,219
Apr-29-15 21:44:02 Processing... messages/spam with 15,000 files
Apr-29-15 21:47:49 Imported Files for HeloBlackList: 15,000
Apr-29-15 21:47:49 Imported Files for Bayes/HMM: 1,888
Apr-29-15 21:47:49 Finished in 227 second(s)
Apr-29-15 21:47:49 info: require apx. all files (360,036 words) from folder
messages/notspam to get the wanted corpusnorm (1.000)

Apr-29-15 21:47:49 c:/assp/messages/notspam
Apr-29-15 21:47:49 File Count: 21,197
Apr-29-15 21:47:49 Processing... messages/notspam with 15,000 files
Apr-29-15 21:52:06 Imported Files for HeloBlackList: 15,000
Apr-29-15 21:52:06 Imported Files for Bayes/HMM: 1,040
Apr-29-15 21:52:06 Finished in 257 second(s)

Apr-29-15 21:52:06 Generating weighted Bayesian tuplets
Apr-29-15 21:52:10 start populating Spamdb with 27,082 records - Bayesian
check is now disabled!
Apr-29-15 21:52:24 Finished populating Spamdb with 27,082 records -
Bayesian check is now enabled!
Apr-29-15 21:52:24 done - Generating weighted Bayesian tuplets

Apr-29-15 21:52:24 Bayesian Pairs: 27,082 now in list

Apr-29-15 21:52:24 Generating consolidated Hidden-Markov-Model database
from 527,319 record model
Apr-29-15 21:52:46 HMM sequences: 259,284 now in list

Apr-29-15 21:52:46 generating Spamdb.helo records from 5,112 collected
HELO's
Apr-29-15 21:52:47 cleaning old Spamdb.helo records
Apr-29-15 21:52:52 done - cleaning old Spamdb.helo records

Apr-29-15 21:52:52 HELO Blacklist: 12 new, 427 now in list

Apr-29-15 21:52:52 Spam Weight:   360,036
Apr-29-15 21:52:52 Not-Spam Weight:   360,070

Apr-29-15 21:52:52 Corpus norm: 0.9999 - (very good - balanced)
Apr-29-15 21:52:52 Corpus confidence: 1.00000000

Apr-29-15 21:52:57 Start populating Hidden Markov Model. HMM-check is
disabled for this time!
Apr-29-15 21:53:01 start populating Hidden Markov Model with 259,284
records!
Apr-29-15 21:53:06 Finished populating Hidden Markov Model with 259,284
records!
Apr-29-15 21:53:06 Finished populating Hidden Markov Model. HMM-check is
now enabled again!

Apr-29-15 21:53:06 Total processing time: 551 second(s)

Apr-29-15 21:53:06 Total processing data: 95.49 MByte


Apr-29-15 21:53:06 Rebuild processed 61.73 files per second.

Apr-29-15 21:53:06 After finishing the Rebuild process, the c:/assp/tmpDB
folder contains 101.74 MByte.

Apr-29-15 21:53:06 After finishing the Rebuild process, the drive that
contains the c:/assp/tmpDB folder has 20.17 GByte free space from total
25.20 GByte.



Why after processing errors-spam and errors-notspam does it say:
Apr-29-15 21:44:02 info: require apx. 2,812 files (360,000 words) from
folder messages/spam to get the wanted corpusnorm (1.000)


How does it know what will be in spam and notspam.  Shouldn't it parse all
and then decide???  Based on the fast 4 minute scan time of each spam and
not spam, I'm guessing it's not looking at all files.  is that normal?
 Seems like a really small spamdb and hmm given 30k files (Even with only
the first 2.5kb being looked at)
------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Assp-test mailing list
Assp-test@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/assp-test

Reply via email to