What should I do with my current spam/notspam/errors collections? Should 
I throw them out and start fresh? If I was able to truncate them all to 
4k in size, would that be good enough?

David wrote:
> Aha, thanks very much for the explanations Kevin and Fritz. I didn't 
> know about assp loading everything into memory. I thought that since I 
> had the disk space to spare, I would be more generous with the 
> collecting. Memory, on the other hand, is definitely at a premium. So 
> thanks for the heads up!
>
> Kevin wrote:
>   
>> David wrote:
>>   
>>     
>>> Should I clear out the collections and start over? It would be much 
>>> easier to train assp with old known-good email if it would accept more 
>>> than one attachment. What is it that is preventing assp from processing 
>>> multiple attachments?
>>>     
>>>       
>> The reason ASSP can not accept multiple attachments is because of the
>> way it handles incoming email. Currently it processes it like a data
>> stream and only reads as much of the data stream as it needs to save the 
>> spam report and ignores the rest. When you send multiple attached 
>> messages you are really only sending one message with a large data 
>> stream attached.
>>
>> At least this is how I understand it working from looking over the code.
>>
>> Personally I used this add-on ( http://proqual.net/saveasmultiple/) 
>> along with thunderbird to save large amounts of messages at a time to 
>> .eml files and then move them manually into the corpus, running 
>> move2num.pl to fix the file names after.
>>
>> At one point I had it exporting a few hundred messages at a time from a 
>> spam/notspam collection I had saved from my previous anti-spam software.
>>
>>   
>>     
>>> Additionally, if I have this 3:1 ratio, should I get assp to only 
>>> collect every 3rd spam? 
>>>     
>>>       
>> Collecting frequency applies to both spam and notspam, it would simply 
>> slow down your corpus growth.
>>
>> Collecting frequency is only recommended for high volume servers where 
>> collecting each message would cause the corpus to be overwritten too 
>> fast, it also increases performance because it does not have to write 
>> each message to disk.
>>
>>  > Once I hit maxfiles, though, it shouldn't
>>  > matter, as old mail will just be rotated out, right?
>>
>> You are FAR from hitting maxFiles for your notspam.
>> The maxFiles setting is per folder, so it would take a while for you to 
>> reach that limit.
>>
>>   
>>     
>>> Why such an odd number as 18009 for maxfiles? And why should only 4k of 
>>> the message be processed? I set it to collect ~50k so it would get all 
>>> of every message. How is that worse?
>>>     
>>>       
>> We use that number for max files because we found that it provided good 
>> results. The default values have been tested by this community for quite 
>> some time, it is highly recommended to not change them.
>>
>> As for the '4096' default for maxBytes, 4096 happens to be the default 
>> cluster size for the NTFS file system(for disks over 2GB), using that as 
>> the max message size helps speed up file system access and reduces 
>> fragmentation on windows, on *nix it's just a good default.
>>
>>  > And why should only 4k of the message be processed?
>>  > I set it to collect ~50k so it would get all of every message.
>>  > How is that worse?
>>
>> Because spam is MUCH wordier than ham. If you took a random sampling of 
>> 200 spam messages and 200 ham messages with a max size of 50k per 
>> message, the spam will be the larger of the two groups text wise most of 
>> the time.
>>
>> Another reason for the default message size and collect limit is that 
>> when you run the rebuild script it loads the entire corpus into memory 
>> for processing, so 18009 files x 2 (notspam and spam collections) x 4096 
>> bytes would be 147529728 bytes or 140MB, this is roughly the max size of 
>> the ASSP controlled corpus with default valies, then there are the spam 
>> and notspam reports, ASSP does NOT overwrite these at random and they 
>> are the only file collection that can grow out of control.
>>
>> With your settings of 'maxfiles' at 10000 and 'maxbytes' at 50000  you 
>> would have a corpus of around 953MB. WAY overkill, and the rebuild 
>> script would probably grind your system to a halt.
>>
>>
>> Kevin
>>
>>
>>
>>
>> -------------------------------------------------------------------------
>> This SF.net email is sponsored by DB2 Express
>> Download DB2 Express C - the FREE version of DB2 express and take
>> control of your XML. No limits. Just data. Click to get it now.
>> http://sourceforge.net/powerbar/db2/
>> _______________________________________________
>> Assp-user mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/assp-user
>>   
>>     
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by DB2 Express
> Download DB2 Express C - the FREE version of DB2 express and take
> control of your XML. No limits. Just data. Click to get it now.
> http://sourceforge.net/powerbar/db2/
> _______________________________________________
> Assp-user mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/assp-user
>   

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Assp-user mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/assp-user

Reply via email to