On Tue, 10 Aug 2010 22:24:43 +0200 Julien Valroff <jul...@kirya.net> wrote:
> Le lundi 09 août 2010 à 21:10 +0200, Stevan Bajić a écrit : > > On Mon, 09 Aug 2010 20:55:55 +0200 > > Julien Valroff <jul...@kirya.net> wrote: > [...] > > > > Well.... what should I say? Keeping track of the state of a signature > > > > is fine and dandy but what do you do if someone is not using > > > > signatures but training the whole message? How do you keep track of > > > > that? > > > > > > Is that even possible? If so, I didn't know... > > > > > Yes. This is possible. I run mainly without signatures at all. > > May I ask you how you do this > You mean how I have setup that to work without signatures? All started years ago by setting "rainPristine" to "on" and has since that time envolved into something much bigger then expected. > and what is the reason of not using the > signature system? > Enemy number one is: The database :) My reason for saying this is simple: Lets make some calculations. I will use simple numbers just for illustration. Okay? * Lets assume I process 1'000'000 mails a day. * Lets assume my FP and FN rate is at 5%. * Lets assume average mail size is 100KB. * Lets assume average word count is 5'000. * Lets assume average size of a decomposed mail is 50KB. Now do the math: Saving all those mails in the signature database will lead to: 1'000'000 * 50KB = 47GB And out of those 47GB I will only need 2.35GB for the training (users doing FP/FN processing). So by not using signatures I save +/- 45GB of data per day that does not need to be saved in the database. On the other hand I need to send 2.35GB more data over the wire (worst case) because of the fact that the whole message needs to be send to DSPAM in order to be able to do the training. I avoid that traffic by offering other methods when doing retraining (one way doing this is by reading the data directly from the mail server where the message in question is sitting (aka: using IMAP and other techniques)). I don't loose speed when processing a FP/FN because regardless where the message is sitting I anyway need to decompose it twice (once when classifying and once when retraining). However... I save the additional query needed to read the data from the dspam_signature_data table. Pretty much the same as when using the Hash driver (which is writing those .sig files in the users home instead of filling up a database). All this together combined allows me to have a faster filtering while offering almost the same functionality as when using signatures and with much less storage requirements. Almost the same and not the same because using signatures is more comfortable. But to be honest: Comfortable for the one implementing the solution. And this is usually the admin. And since I can code and can glue things together... why not investing some time to build something that is enough comfortable AND helps me to scale better and process more with the absolute same hardware? > Just out of curiosity... > > Cheers, > Julien > > -- > Julien Valroff <jul...@kirya.net> > http://www.kirya.net > GPG key: 4096R/290D20C5 > 092F 4CB5 5F19 E006 1CFD B489 D32B 8D66 290D 20C5 > > > ------------------------------------------------------------------------------ > This SF.net email is sponsored by > > Make an app they can't live without > Enter the BlackBerry Developer Challenge > http://p.sf.net/sfu/RIM-dev2dev > _______________________________________________ > Dspam-user mailing list > Dspam-user@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/dspam-user ------------------------------------------------------------------------------ This SF.net email is sponsored by Make an app they can't live without Enter the BlackBerry Developer Challenge http://p.sf.net/sfu/RIM-dev2dev _______________________________________________ Dspam-user mailing list Dspam-user@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspam-user