Aha, thanks very much for the explanations Kevin and Fritz. I didn't know about assp loading everything into memory. I thought that since I had the disk space to spare, I would be more generous with the collecting. Memory, on the other hand, is definitely at a premium. So thanks for the heads up!
Kevin wrote: > David wrote: > >> Should I clear out the collections and start over? It would be much >> easier to train assp with old known-good email if it would accept more >> than one attachment. What is it that is preventing assp from processing >> multiple attachments? >> > > The reason ASSP can not accept multiple attachments is because of the > way it handles incoming email. Currently it processes it like a data > stream and only reads as much of the data stream as it needs to save the > spam report and ignores the rest. When you send multiple attached > messages you are really only sending one message with a large data > stream attached. > > At least this is how I understand it working from looking over the code. > > Personally I used this add-on ( http://proqual.net/saveasmultiple/) > along with thunderbird to save large amounts of messages at a time to > .eml files and then move them manually into the corpus, running > move2num.pl to fix the file names after. > > At one point I had it exporting a few hundred messages at a time from a > spam/notspam collection I had saved from my previous anti-spam software. > > >> Additionally, if I have this 3:1 ratio, should I get assp to only >> collect every 3rd spam? >> > > Collecting frequency applies to both spam and notspam, it would simply > slow down your corpus growth. > > Collecting frequency is only recommended for high volume servers where > collecting each message would cause the corpus to be overwritten too > fast, it also increases performance because it does not have to write > each message to disk. > > > Once I hit maxfiles, though, it shouldn't > > matter, as old mail will just be rotated out, right? > > You are FAR from hitting maxFiles for your notspam. > The maxFiles setting is per folder, so it would take a while for you to > reach that limit. > > >> Why such an odd number as 18009 for maxfiles? And why should only 4k of >> the message be processed? I set it to collect ~50k so it would get all >> of every message. How is that worse? >> > > We use that number for max files because we found that it provided good > results. The default values have been tested by this community for quite > some time, it is highly recommended to not change them. > > As for the '4096' default for maxBytes, 4096 happens to be the default > cluster size for the NTFS file system(for disks over 2GB), using that as > the max message size helps speed up file system access and reduces > fragmentation on windows, on *nix it's just a good default. > > > And why should only 4k of the message be processed? > > I set it to collect ~50k so it would get all of every message. > > How is that worse? > > Because spam is MUCH wordier than ham. If you took a random sampling of > 200 spam messages and 200 ham messages with a max size of 50k per > message, the spam will be the larger of the two groups text wise most of > the time. > > Another reason for the default message size and collect limit is that > when you run the rebuild script it loads the entire corpus into memory > for processing, so 18009 files x 2 (notspam and spam collections) x 4096 > bytes would be 147529728 bytes or 140MB, this is roughly the max size of > the ASSP controlled corpus with default valies, then there are the spam > and notspam reports, ASSP does NOT overwrite these at random and they > are the only file collection that can grow out of control. > > With your settings of 'maxfiles' at 10000 and 'maxbytes' at 50000 you > would have a corpus of around 953MB. WAY overkill, and the rebuild > script would probably grind your system to a halt. > > > Kevin > > > > > ------------------------------------------------------------------------- > This SF.net email is sponsored by DB2 Express > Download DB2 Express C - the FREE version of DB2 express and take > control of your XML. No limits. Just data. Click to get it now. > http://sourceforge.net/powerbar/db2/ > _______________________________________________ > Assp-user mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/assp-user > ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Assp-user mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/assp-user
