Aha, thanks very much for the explanations Kevin and Fritz. I didn't 
know about assp loading everything into memory. I thought that since I 
had the disk space to spare, I would be more generous with the 
collecting. Memory, on the other hand, is definitely at a premium. So 
thanks for the heads up!

Kevin wrote:
> David wrote:
>   
>> Should I clear out the collections and start over? It would be much 
>> easier to train assp with old known-good email if it would accept more 
>> than one attachment. What is it that is preventing assp from processing 
>> multiple attachments?
>>     
>
> The reason ASSP can not accept multiple attachments is because of the
> way it handles incoming email. Currently it processes it like a data
> stream and only reads as much of the data stream as it needs to save the 
> spam report and ignores the rest. When you send multiple attached 
> messages you are really only sending one message with a large data 
> stream attached.
>
> At least this is how I understand it working from looking over the code.
>
> Personally I used this add-on ( http://proqual.net/saveasmultiple/) 
> along with thunderbird to save large amounts of messages at a time to 
> .eml files and then move them manually into the corpus, running 
> move2num.pl to fix the file names after.
>
> At one point I had it exporting a few hundred messages at a time from a 
> spam/notspam collection I had saved from my previous anti-spam software.
>
>   
>> Additionally, if I have this 3:1 ratio, should I get assp to only 
>> collect every 3rd spam? 
>>     
>
> Collecting frequency applies to both spam and notspam, it would simply 
> slow down your corpus growth.
>
> Collecting frequency is only recommended for high volume servers where 
> collecting each message would cause the corpus to be overwritten too 
> fast, it also increases performance because it does not have to write 
> each message to disk.
>
>  > Once I hit maxfiles, though, it shouldn't
>  > matter, as old mail will just be rotated out, right?
>
> You are FAR from hitting maxFiles for your notspam.
> The maxFiles setting is per folder, so it would take a while for you to 
> reach that limit.
>
>   
>> Why such an odd number as 18009 for maxfiles? And why should only 4k of 
>> the message be processed? I set it to collect ~50k so it would get all 
>> of every message. How is that worse?
>>     
>
> We use that number for max files because we found that it provided good 
> results. The default values have been tested by this community for quite 
> some time, it is highly recommended to not change them.
>
> As for the '4096' default for maxBytes, 4096 happens to be the default 
> cluster size for the NTFS file system(for disks over 2GB), using that as 
> the max message size helps speed up file system access and reduces 
> fragmentation on windows, on *nix it's just a good default.
>
>  > And why should only 4k of the message be processed?
>  > I set it to collect ~50k so it would get all of every message.
>  > How is that worse?
>
> Because spam is MUCH wordier than ham. If you took a random sampling of 
> 200 spam messages and 200 ham messages with a max size of 50k per 
> message, the spam will be the larger of the two groups text wise most of 
> the time.
>
> Another reason for the default message size and collect limit is that 
> when you run the rebuild script it loads the entire corpus into memory 
> for processing, so 18009 files x 2 (notspam and spam collections) x 4096 
> bytes would be 147529728 bytes or 140MB, this is roughly the max size of 
> the ASSP controlled corpus with default valies, then there are the spam 
> and notspam reports, ASSP does NOT overwrite these at random and they 
> are the only file collection that can grow out of control.
>
> With your settings of 'maxfiles' at 10000 and 'maxbytes' at 50000  you 
> would have a corpus of around 953MB. WAY overkill, and the rebuild 
> script would probably grind your system to a halt.
>
>
> Kevin
>
>
>
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by DB2 Express
> Download DB2 Express C - the FREE version of DB2 express and take
> control of your XML. No limits. Just data. Click to get it now.
> http://sourceforge.net/powerbar/db2/
> _______________________________________________
> Assp-user mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/assp-user
>   

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Assp-user mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/assp-user

Reply via email to