first:

removing/moving files from the 'errors' corpus like you did, is one of the 
most stupid things I ever saw! You gave assp an apoplexy!

>How does it know what will be in spam and notspam.

ASSP knows what was done in the rebuild task in the past and how it was 
done.
If you do mayor changes manualy in the corpus - remove the file 
'assp/normfile' and run the rebuild twice. The first to get a clue about 
what has been changed, the second to get the best detection rate.

Thomas





Von:    K Post <nntp.p...@gmail.com>
An:     ASSP development mailing list <assp-test@lists.sourceforge.net>
Datum:  30.04.2015 04:10
Betreff:        [Assp-test] Rebuild not parsing everything



I was seeing a very fast parsing of notspam, I believe because there was a
lot of errors-notspam, but much less errors-spam.

As a test, I moved all errors-spam to spam and errors-notspam to notspam,
leaving 0 error reports.

Here's the result of the rebuild:


Apr-29-15 21:43:55 RebuildSpamDB-thread rebuildspamdb-version 7.10 started
in ASSP version 2.4.4(15117)

Apr-29-15 21:43:55 rebuild debug output is enabled to
c:/assp/rebuilddebug.txt

Apr-29-15 21:43:55 RebuildSpamDB uses BerkeleyDB for temporary hashes

Apr-29-15 21:43:55 RebuildSpamDB uses BerkeleyDB-ENV with 62.50 MByte

Apr-29-15 21:43:55 RebuildSpamDB will create a Hidden Markov Model!

Apr-29-15 21:43:55 RebuildSpamDB will create unicode enabled databases.

Apr-29-15 21:43:55 RebuildSpamDB will process all words as Sequence of UAX
#29 Grapheme Clusters.

Apr-29-15 21:43:55 RebuildSpamDB will normalize unicode characters.

Apr-29-15 21:43:55 RebuildSpamDB will use the ASSP_WordStem engine.

Apr-29-15 21:43:55 ---ASSP Settings---
Apr-29-15 21:43:55 Do Not Collect Messages with RedListed address: Enabled
**Messages with RedListed addresses will be removed from the corpus!**

Apr-29-15 21:43:55 Do Not Collect RedRe Messages: Enabled **Messages
matching the RedRe will be removed from the corpus!**

Apr-29-15 21:43:55 Use Subject as Maillog Names: True
Apr-29-15 21:43:55 Maxbytes: 2,500
Apr-29-15 21:43:55 RebuildFileTimeLimit: 1 5
Apr-29-15 21:43:55 RebuildFileTimeLimit: files will be moved away from the
corpus, if their processing takes longer than 5 second(s)

Apr-29-15 21:44:02 Trashlist cleaning finished, 0 of 23606 files deleted

Apr-29-15 21:44:02 c:/assp/messages/errors-spam
Apr-29-15 21:44:02 File Count: 0
Apr-29-15 21:44:02 Processing... messages/errors-spam with 0 files
Apr-29-15 21:44:02 Imported Files for HeloBlackList: 0
Apr-29-15 21:44:02 Imported Files for Bayes/HMM: 0
Apr-29-15 21:44:02 Finished in 1 second(s)

Apr-29-15 21:44:02 c:/assp/messages/errors-notspam
Apr-29-15 21:44:02 File Count: 0
Apr-29-15 21:44:02 Processing... messages/errors-notspam with 0 files
Apr-29-15 21:44:02 Imported Files for HeloBlackList: 0
Apr-29-15 21:44:02 Imported Files for Bayes/HMM: 0
Apr-29-15 21:44:02 Finished in 1 second(s)
Apr-29-15 21:44:02 info: corpusnorm after processing messages/errors-spam
and messages/errors-notspam is Spam Weight: 0 / Not-Spam Weight: 0 => 
norm:
1.000
Apr-29-15 21:44:02 info: require apx. 2,812 files (360,000 words) from
folder messages/spam to get the wanted corpusnorm (1.000)

Apr-29-15 21:44:02 c:/assp/messages/spam
Apr-29-15 21:44:02 File Count: 18,219
Apr-29-15 21:44:02 Processing... messages/spam with 15,000 files
Apr-29-15 21:47:49 Imported Files for HeloBlackList: 15,000
Apr-29-15 21:47:49 Imported Files for Bayes/HMM: 1,888
Apr-29-15 21:47:49 Finished in 227 second(s)
Apr-29-15 21:47:49 info: require apx. all files (360,036 words) from 
folder
messages/notspam to get the wanted corpusnorm (1.000)

Apr-29-15 21:47:49 c:/assp/messages/notspam
Apr-29-15 21:47:49 File Count: 21,197
Apr-29-15 21:47:49 Processing... messages/notspam with 15,000 files
Apr-29-15 21:52:06 Imported Files for HeloBlackList: 15,000
Apr-29-15 21:52:06 Imported Files for Bayes/HMM: 1,040
Apr-29-15 21:52:06 Finished in 257 second(s)

Apr-29-15 21:52:06 Generating weighted Bayesian tuplets
Apr-29-15 21:52:10 start populating Spamdb with 27,082 records - Bayesian
check is now disabled!
Apr-29-15 21:52:24 Finished populating Spamdb with 27,082 records -
Bayesian check is now enabled!
Apr-29-15 21:52:24 done - Generating weighted Bayesian tuplets

Apr-29-15 21:52:24 Bayesian Pairs: 27,082 now in list

Apr-29-15 21:52:24 Generating consolidated Hidden-Markov-Model database
from 527,319 record model
Apr-29-15 21:52:46 HMM sequences: 259,284 now in list

Apr-29-15 21:52:46 generating Spamdb.helo records from 5,112 collected
HELO's
Apr-29-15 21:52:47 cleaning old Spamdb.helo records
Apr-29-15 21:52:52 done - cleaning old Spamdb.helo records

Apr-29-15 21:52:52 HELO Blacklist: 12 new, 427 now in list

Apr-29-15 21:52:52 Spam Weight:   360,036
Apr-29-15 21:52:52 Not-Spam Weight:   360,070

Apr-29-15 21:52:52 Corpus norm: 0.9999 - (very good - balanced)
Apr-29-15 21:52:52 Corpus confidence: 1.00000000

Apr-29-15 21:52:57 Start populating Hidden Markov Model. HMM-check is
disabled for this time!
Apr-29-15 21:53:01 start populating Hidden Markov Model with 259,284
records!
Apr-29-15 21:53:06 Finished populating Hidden Markov Model with 259,284
records!
Apr-29-15 21:53:06 Finished populating Hidden Markov Model. HMM-check is
now enabled again!

Apr-29-15 21:53:06 Total processing time: 551 second(s)

Apr-29-15 21:53:06 Total processing data: 95.49 MByte


Apr-29-15 21:53:06 Rebuild processed 61.73 files per second.

Apr-29-15 21:53:06 After finishing the Rebuild process, the c:/assp/tmpDB
folder contains 101.74 MByte.

Apr-29-15 21:53:06 After finishing the Rebuild process, the drive that
contains the c:/assp/tmpDB folder has 20.17 GByte free space from total
25.20 GByte.



Why after processing errors-spam and errors-notspam does it say:
Apr-29-15 21:44:02 info: require apx. 2,812 files (360,000 words) from
folder messages/spam to get the wanted corpusnorm (1.000)


How does it know what will be in spam and notspam.  Shouldn't it parse all
and then decide???  Based on the fast 4 minute scan time of each spam and
not spam, I'm guessing it's not looking at all files.  is that normal?
 Seems like a really small spamdb and hmm given 30k files (Even with only
the first 2.5kb being looked at)
------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Assp-test mailing list
Assp-test@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/assp-test






DISCLAIMER:
*******************************************************
This email and any files transmitted with it may be confidential, legally 
privileged and protected in law and are intended solely for the use of the 

individual to whom it is addressed.
This email was multiple times scanned for viruses. There should be no 
known virus in this email!
*******************************************************

------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Assp-test mailing list
Assp-test@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/assp-test

Reply via email to