David wrote:
> Should I clear out the collections and start over? It would be much 
> easier to train assp with old known-good email if it would accept more 
> than one attachment. What is it that is preventing assp from processing 
> multiple attachments?

The reason ASSP can not accept multiple attachments is because of the
way it handles incoming email. Currently it processes it like a data
stream and only reads as much of the data stream as it needs to save the 
spam report and ignores the rest. When you send multiple attached 
messages you are really only sending one message with a large data 
stream attached.

At least this is how I understand it working from looking over the code.

Personally I used this add-on ( http://proqual.net/saveasmultiple/) 
along with thunderbird to save large amounts of messages at a time to 
.eml files and then move them manually into the corpus, running 
move2num.pl to fix the file names after.

At one point I had it exporting a few hundred messages at a time from a 
spam/notspam collection I had saved from my previous anti-spam software.

> Additionally, if I have this 3:1 ratio, should I get assp to only 
> collect every 3rd spam? 

Collecting frequency applies to both spam and notspam, it would simply 
slow down your corpus growth.

Collecting frequency is only recommended for high volume servers where 
collecting each message would cause the corpus to be overwritten too 
fast, it also increases performance because it does not have to write 
each message to disk.

 > Once I hit maxfiles, though, it shouldn't
 > matter, as old mail will just be rotated out, right?

You are FAR from hitting maxFiles for your notspam.
The maxFiles setting is per folder, so it would take a while for you to 
reach that limit.

> Why such an odd number as 18009 for maxfiles? And why should only 4k of 
> the message be processed? I set it to collect ~50k so it would get all 
> of every message. How is that worse?

We use that number for max files because we found that it provided good 
results. The default values have been tested by this community for quite 
some time, it is highly recommended to not change them.

As for the '4096' default for maxBytes, 4096 happens to be the default 
cluster size for the NTFS file system(for disks over 2GB), using that as 
the max message size helps speed up file system access and reduces 
fragmentation on windows, on *nix it's just a good default.

 > And why should only 4k of the message be processed?
 > I set it to collect ~50k so it would get all of every message.
 > How is that worse?

Because spam is MUCH wordier than ham. If you took a random sampling of 
200 spam messages and 200 ham messages with a max size of 50k per 
message, the spam will be the larger of the two groups text wise most of 
the time.

Another reason for the default message size and collect limit is that 
when you run the rebuild script it loads the entire corpus into memory 
for processing, so 18009 files x 2 (notspam and spam collections) x 4096 
bytes would be 147529728 bytes or 140MB, this is roughly the max size of 
the ASSP controlled corpus with default valies, then there are the spam 
and notspam reports, ASSP does NOT overwrite these at random and they 
are the only file collection that can grow out of control.

With your settings of 'maxfiles' at 10000 and 'maxbytes' at 50000  you 
would have a corpus of around 953MB. WAY overkill, and the rebuild 
script would probably grind your system to a halt.


Kevin




-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Assp-user mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/assp-user

Reply via email to