Re: [Dspam-user] Re-training

Stevan Bajić Tue, 10 Aug 2010 14:21:14 -0700

On Tue, 10 Aug 2010 22:24:43 +0200
Julien Valroff <jul...@kirya.net> wrote:

> Le lundi 09 août 2010 à 21:10 +0200, Stevan Bajić a écrit :
> > On Mon, 09 Aug 2010 20:55:55 +0200
> > Julien Valroff <jul...@kirya.net> wrote:
> [...]
> > > > Well.... what should I say? Keeping track of the state of a signature
> > > > is fine and dandy but what do you do if someone is not using
> > > > signatures but training the whole message? How do you keep track of
> > > > that?
> > > 
> > > Is that even possible? If so, I didn't know...
> > > 
> > Yes. This is possible. I run mainly without signatures at all.
> 
> May I ask you how you do this
>
You mean how I have setup that to work without signatures? All started years 
ago by setting "rainPristine" to "on" and has since that time envolved into 
something much bigger then expected.

> and what is the reason of not using the
> signature system?
>
Enemy number one is: The database :)
My reason for saying this is simple:
Lets make some calculations. I will use simple numbers just for illustration. 
Okay?
* Lets assume I process 1'000'000 mails a day.
* Lets assume my FP and FN rate is at 5%.
* Lets assume average mail size is 100KB.
* Lets assume average word count is 5'000.
* Lets assume average size of a decomposed mail is 50KB.

Now do the math:

Saving all those mails in the signature database will lead to:
1'000'000 * 50KB = 47GB

And out of those 47GB I will only need 2.35GB for the training (users doing 
FP/FN processing).

So by not using signatures I save +/- 45GB of data per day that does not need 
to be saved in the database.

On the other hand I need to send 2.35GB more data over the wire (worst case) 
because of the fact that the whole message needs to be send to DSPAM in order 
to be able to do the training. I avoid that traffic by offering other methods 
when doing retraining (one way doing this is by reading the data directly from 
the mail server where the message in question is sitting (aka: using IMAP and 
other techniques)).

I don't loose speed when processing a FP/FN because regardless where the 
message is sitting I anyway need to decompose it twice (once when classifying 
and once when retraining). However... I save the additional query needed to 
read the data from the dspam_signature_data table. Pretty much the same as when 
using the Hash driver (which is writing those .sig files in the users home 
instead of filling up a database).

All this together combined allows me to have a faster filtering while offering 
almost the same functionality as when using signatures and with much less 
storage requirements. Almost the same and not the same because using signatures 
is more comfortable. But to be honest: Comfortable for the one implementing the 
solution. And this is usually the admin. And since I can code and can glue 
things together... why not investing some time to build something that is 
enough comfortable AND helps me to scale better and process more with the 
absolute same hardware?

> Just out of curiosity...
> 
> Cheers,
> Julien 
> 
> -- 
> Julien Valroff <jul...@kirya.net>
> http://www.kirya.net
> GPG key: 4096R/290D20C5 
> 092F 4CB5 5F19 E006 1CFD  B489 D32B 8D66 290D 20C5
> 
> 
> ------------------------------------------------------------------------------
> This SF.net email is sponsored by 
> 
> Make an app they can't live without
> Enter the BlackBerry Developer Challenge
> http://p.sf.net/sfu/RIM-dev2dev 
> _______________________________________________
> Dspam-user mailing list
> Dspam-user@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dspam-user

------------------------------------------------------------------------------
This SF.net email is sponsored by 

Make an app they can't live without
Enter the BlackBerry Developer Challenge
http://p.sf.net/sfu/RIM-dev2dev 
_______________________________________________
Dspam-user mailing list
Dspam-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspam-user

Re: [Dspam-user] Re-training

Reply via email to