What should I do with my current spam/notspam/errors collections? Should I throw them out and start fresh? If I was able to truncate them all to 4k in size, would that be good enough?
David wrote: > Aha, thanks very much for the explanations Kevin and Fritz. I didn't > know about assp loading everything into memory. I thought that since I > had the disk space to spare, I would be more generous with the > collecting. Memory, on the other hand, is definitely at a premium. So > thanks for the heads up! > > Kevin wrote: > >> David wrote: >> >> >>> Should I clear out the collections and start over? It would be much >>> easier to train assp with old known-good email if it would accept more >>> than one attachment. What is it that is preventing assp from processing >>> multiple attachments? >>> >>> >> The reason ASSP can not accept multiple attachments is because of the >> way it handles incoming email. Currently it processes it like a data >> stream and only reads as much of the data stream as it needs to save the >> spam report and ignores the rest. When you send multiple attached >> messages you are really only sending one message with a large data >> stream attached. >> >> At least this is how I understand it working from looking over the code. >> >> Personally I used this add-on ( http://proqual.net/saveasmultiple/) >> along with thunderbird to save large amounts of messages at a time to >> .eml files and then move them manually into the corpus, running >> move2num.pl to fix the file names after. >> >> At one point I had it exporting a few hundred messages at a time from a >> spam/notspam collection I had saved from my previous anti-spam software. >> >> >> >>> Additionally, if I have this 3:1 ratio, should I get assp to only >>> collect every 3rd spam? >>> >>> >> Collecting frequency applies to both spam and notspam, it would simply >> slow down your corpus growth. >> >> Collecting frequency is only recommended for high volume servers where >> collecting each message would cause the corpus to be overwritten too >> fast, it also increases performance because it does not have to write >> each message to disk. >> >> > Once I hit maxfiles, though, it shouldn't >> > matter, as old mail will just be rotated out, right? >> >> You are FAR from hitting maxFiles for your notspam. >> The maxFiles setting is per folder, so it would take a while for you to >> reach that limit. >> >> >> >>> Why such an odd number as 18009 for maxfiles? And why should only 4k of >>> the message be processed? I set it to collect ~50k so it would get all >>> of every message. How is that worse? >>> >>> >> We use that number for max files because we found that it provided good >> results. The default values have been tested by this community for quite >> some time, it is highly recommended to not change them. >> >> As for the '4096' default for maxBytes, 4096 happens to be the default >> cluster size for the NTFS file system(for disks over 2GB), using that as >> the max message size helps speed up file system access and reduces >> fragmentation on windows, on *nix it's just a good default. >> >> > And why should only 4k of the message be processed? >> > I set it to collect ~50k so it would get all of every message. >> > How is that worse? >> >> Because spam is MUCH wordier than ham. If you took a random sampling of >> 200 spam messages and 200 ham messages with a max size of 50k per >> message, the spam will be the larger of the two groups text wise most of >> the time. >> >> Another reason for the default message size and collect limit is that >> when you run the rebuild script it loads the entire corpus into memory >> for processing, so 18009 files x 2 (notspam and spam collections) x 4096 >> bytes would be 147529728 bytes or 140MB, this is roughly the max size of >> the ASSP controlled corpus with default valies, then there are the spam >> and notspam reports, ASSP does NOT overwrite these at random and they >> are the only file collection that can grow out of control. >> >> With your settings of 'maxfiles' at 10000 and 'maxbytes' at 50000 you >> would have a corpus of around 953MB. WAY overkill, and the rebuild >> script would probably grind your system to a halt. >> >> >> Kevin >> >> >> >> >> ------------------------------------------------------------------------- >> This SF.net email is sponsored by DB2 Express >> Download DB2 Express C - the FREE version of DB2 express and take >> control of your XML. No limits. Just data. Click to get it now. >> http://sourceforge.net/powerbar/db2/ >> _______________________________________________ >> Assp-user mailing list >> [email protected] >> https://lists.sourceforge.net/lists/listinfo/assp-user >> >> > > ------------------------------------------------------------------------- > This SF.net email is sponsored by DB2 Express > Download DB2 Express C - the FREE version of DB2 express and take > control of your XML. No limits. Just data. Click to get it now. > http://sourceforge.net/powerbar/db2/ > _______________________________________________ > Assp-user mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/assp-user > ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Assp-user mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/assp-user
