Re: [Dspam-user] Retraining Failing After Upgrade to 3.9.0-ALPHA2

Steve Fri, 03 Jul 2009 03:50:46 -0700

> Hello Steve,
>
Ahoi Josef,


>   I am quite old user of dspam (I started with 3.6.x) and now I am 
> cosidering trying 3.9.0 Alpha.
> 
:)


> I personally  think that dspam should not do any special checking for 
> configuration sanity, but I docs should be updated with
> sample configs for common scenarios and some care should be taken to 
> help user to choose best tokenizer/algo/storage/trainmode for
> their scenario.
> 
I agree.


> I have chosen years ago train on error, algorithm graham burton and 
> mysql storage.
> 
Good choice.


> Mysql was a choice for me because I wanted daemon mode and it is thread 
> safe.
> I have chosen train on error because I wanted to keep token data small.
> 
> Some info on this is on dspam wiki, but it is generally hard to find 
> which combination to start with.
> 
> I would like to install new dspam to one of our filter servers, which 
> handles cca 4000 hams and 3000 spams daily.
> I'd probably use mysql again, becouse I know it quite well. I'd try teft 
> training mode, I do not expect big load on this server.
> 
> I am not sure about tokenizer and algo. My preference is to get very low 
> number of false positives.
> I think I should try to start with osb or chain tokenizer, graham+burton 
> and markov.
> 
Markov is out of the discussion if you use MySQL. Go for OSB.
Stay with TOE since it will internally anyway be TEFT and in some cases TUM 
until you reach a certain statical level.

For the algorithm I would suggest you to use during corpus training "graham 
burton naive" and then later for production I would suggest you to choose just 
"graham burton". The reason why I suggest you to use "naive" during corpus 
training is because "naive" will lead to more false positive and false negative 
but that's fine during corpus training. Later in production "naive" is better 
removed if you are not using Markov.

For the tokenizer you can go with chain but then you need to stay on chain 
because switching later to OSB will render your training useless. So start from 
the beginning with OSB and stay there.

I have a training script which does double sided training and TONE (train on 
error or near error) training and it uses a training threshold. It is much 
faster then the original training script because it does NOT write anything to 
the signature table. I avoid that because in corpus training I don't need DSPAM 
to write there since I anyway have the mail available in raw format.
The double side training is a method to get DSPAM to learn faster what spam/ham 
is by unlearning a message from the other class in case of errors. For example: 
Assume you have a mail having the words: "Buy cheep Viagra here 
http://www.domain.tld/";

Now should DSPAM say that this mail is INNOCENT then I have a False Negative 
and my script goes on and:
1) learns the message as SPAM
2) unlearns the tokens from INNOCENT

And it does the same with a False Positive but in case of a FP it does:
1) learns the message as INNOCENT
2) unlearns the tokens from SPAM

The TONE training is done by defining a threshold. I use a symmetric threshold. 
To illustrate what I mean I am going to make an example with a threshold of 40. 
Now assume you have a message SPAM001 and assume that DSPAM is saying that it 
is indeed spam but it says that the confidence is 0.15 (aka 15%) then my 
training script is going to LEARN that message anyway because 0.15 is smaller 
then my threshold of 40 (aka 0.40). And the same is done with a HAM message.

That way of training helps me to keep the data very low. I just did some weeks 
ago a training for a new setup where I have processed around 2 to 3 million 
mails (the corpus is pretty good balanced with 50% spam and 50% ham) and the 
result after training was:
                TP True Positives:                     0
                TN True Negatives:                     0
                FP False Positives:                    0
                FN False Negatives:                    0
                SC Spam Corpusfed:                  5539
                NC Nonspam Corpusfed:               9908
                TL Training Left:                      0
                SHR Spam Hit Rate                100.00%
                HSR Ham Strike Rate:             100.00%
                PPV Positive predictive value:   100.00%
                OCA Overall Accuracy:            100.00%

Inside MySQL the statistical data for that user looks like that:
mysql> select * from dspam_stats where uid=2\G
*************************** 1. row ***************************
                   uid: 2
          spam_learned: 1615
      innocent_learned: 4369
    spam_misclassified: 0
innocent_misclassified: 0
        spam_corpusfed: 5539
    innocent_corpusfed: 9908
       spam_classified: 0
   innocent_classified: 0
1 row in set (0.00 sec)

mysql>

So that's below 10'000 mails trained for 2 to 3 million corpus mails. That's 
below 0.5% of all corpus mails that needed to be trained. As you see real 
training was just for 1'615 spam mails and 4'369 innocent mails. But I have set 
a forced training loop to train a FP/FN up to 5 times until it got the right 
class. So if a message would result in FP or FN then I train the message up to 
5 times (the number is configurable on the command line when calling the 
training script) until the message is correctly classified but not more then 5 
times. That's like TestConditionalTraining but done by hand since I like to 
control my self the frequency of a retraining and not let 
TestConditionalTraining do the loop internally.

The statistical output above is misleading since it says 100.00% accuracy but 
it is not 100% accurate. I mean the training did not reach 100% accuracy 
because the SC and NC tells me that there where errors which resulted in 
training. The 100% come from the way I train.

I can share the training script if you should need it.

The setup you are planing to build: Is it going to be just for you or a bunch 
of other people too? If so: Have you considered using groups in DSPAM? That 
would help your other users to reach faster a certain accuracy then without 
using groups.


> -- 
> 
> S pozdravem
>
Cheers,


> Josef Liška
>
Steve

-- 
GRATIS für alle GMX-Mitglieder: Die maxdome Movie-FLAT!
Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome01

------------------------------------------------------------------------------
_______________________________________________
Dspam-user mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspam-user

Re: [Dspam-user] Retraining Failing After Upgrade to 3.9.0-ALPHA2

Reply via email to