Hi Steve,
your point on non writing signatures to db during training is perfectly clear, in fact I was
thinking about it some time ago.

And yes, I'd appreciate your training script.

My current and still working installation of old dspam uses one shared group for all users.

Now I am trying to make userless setup on filter machine placed in front of real imap server(s). My idea is to replace bayes filter in spamassasin with dspam. Bayes in SA is blackbox for me and it's performance seems to be really suboptimal.

In my experience users do not train, which leads to poor results. I have even seen user, who retrained dspam to deliver almost all spam an eat almost all ham. Currently I use spamassasin for most of users (they do not have to care about it) and dspam (on imap server) for me and few experienced users.

Next step would be to leverage dspam classification maintained by few experienced users so that other users will get more acurate results without
having to know about dspam at all.


Steve napsal(a):
Hello Steve,

Ahoi Josef,


I am quite old user of dspam (I started with 3.6.x) and now I am cosidering trying 3.9.0 Alpha.

:)


I personally think that dspam should not do any special checking for configuration sanity, but I docs should be updated with sample configs for common scenarios and some care should be taken to help user to choose best tokenizer/algo/storage/trainmode for
their scenario.

I agree.


I have chosen years ago train on error, algorithm graham burton and mysql storage.

Good choice.


Mysql was a choice for me because I wanted daemon mode and it is thread safe.
I have chosen train on error because I wanted to keep token data small.

Some info on this is on dspam wiki, but it is generally hard to find which combination to start with.

I would like to install new dspam to one of our filter servers, which handles cca 4000 hams and 3000 spams daily. I'd probably use mysql again, becouse I know it quite well. I'd try teft training mode, I do not expect big load on this server.

I am not sure about tokenizer and algo. My preference is to get very low number of false positives. I think I should try to start with osb or chain tokenizer, graham+burton and markov.

Markov is out of the discussion if you use MySQL. Go for OSB.
Stay with TOE since it will internally anyway be TEFT and in some cases TUM 
until you reach a certain statical level.

For the algorithm I would suggest you to use during corpus training "graham burton naive" and then later for production 
I would suggest you to choose just "graham burton". The reason why I suggest you to use "naive" during corpus 
training is because "naive" will lead to more false positive and false negative but that's fine during corpus training. 
Later in production "naive" is better removed if you are not using Markov.

For the tokenizer you can go with chain but then you need to stay on chain 
because switching later to OSB will render your training useless. So start from 
the beginning with OSB and stay there.

I have a training script which does double sided training and TONE (train on 
error or near error) training and it uses a training threshold. It is much 
faster then the original training script because it does NOT write anything to 
the signature table. I avoid that because in corpus training I don't need DSPAM 
to write there since I anyway have the mail available in raw format.
The double side training is a method to get DSPAM to learn faster what spam/ham is by 
unlearning a message from the other class in case of errors. For example: Assume you have 
a mail having the words: "Buy cheep Viagra here http://www.domain.tld/";

Now should DSPAM say that this mail is INNOCENT then I have a False Negative 
and my script goes on and:
1) learns the message as SPAM
2) unlearns the tokens from INNOCENT

And it does the same with a False Positive but in case of a FP it does:
1) learns the message as INNOCENT
2) unlearns the tokens from SPAM

The TONE training is done by defining a threshold. I use a symmetric threshold. 
To illustrate what I mean I am going to make an example with a threshold of 40. 
Now assume you have a message SPAM001 and assume that DSPAM is saying that it 
is indeed spam but it says that the confidence is 0.15 (aka 15%) then my 
training script is going to LEARN that message anyway because 0.15 is smaller 
then my threshold of 40 (aka 0.40). And the same is done with a HAM message.

That way of training helps me to keep the data very low. I just did some weeks 
ago a training for a new setup where I have processed around 2 to 3 million 
mails (the corpus is pretty good balanced with 50% spam and 50% ham) and the 
result after training was:
                TP True Positives:                     0
                TN True Negatives:                     0
                FP False Positives:                    0
                FN False Negatives:                    0
                SC Spam Corpusfed:                  5539
                NC Nonspam Corpusfed:               9908
                TL Training Left:                      0
                SHR Spam Hit Rate                100.00%
                HSR Ham Strike Rate:             100.00%
                PPV Positive predictive value:   100.00%
                OCA Overall Accuracy:            100.00%

Inside MySQL the statistical data for that user looks like that:
mysql> select * from dspam_stats where uid=2\G
*************************** 1. row ***************************
                   uid: 2
          spam_learned: 1615
      innocent_learned: 4369
    spam_misclassified: 0
innocent_misclassified: 0
        spam_corpusfed: 5539
    innocent_corpusfed: 9908
       spam_classified: 0
   innocent_classified: 0
1 row in set (0.00 sec)

mysql>

So that's below 10'000 mails trained for 2 to 3 million corpus mails. That's 
below 0.5% of all corpus mails that needed to be trained. As you see real 
training was just for 1'615 spam mails and 4'369 innocent mails. But I have set 
a forced training loop to train a FP/FN up to 5 times until it got the right 
class. So if a message would result in FP or FN then I train the message up to 
5 times (the number is configurable on the command line when calling the 
training script) until the message is correctly classified but not more then 5 
times. That's like TestConditionalTraining but done by hand since I like to 
control my self the frequency of a retraining and not let 
TestConditionalTraining do the loop internally.

The statistical output above is misleading since it says 100.00% accuracy but 
it is not 100% accurate. I mean the training did not reach 100% accuracy 
because the SC and NC tells me that there where errors which resulted in 
training. The 100% come from the way I train.

I can share the training script if you should need it.

The setup you are planing to build: Is it going to be just for you or a bunch 
of other people too? If so: Have you considered using groups in DSPAM? That 
would help your other users to reach faster a certain accuracy then without 
using groups.


--

S pozdravem

Cheers,


Josef Liška

Steve



--

S pozdravem
Josef Liška

CHL | system care

Telefon: +420.272048055
Fax: +420.272048064
Mobil: +420.776026526 denně 9:00 - 17:30
Jabber: [email protected]
https://www.chl.cz/

begin:vcard
fn;quoted-printable:Josef Li=C5=A1ka
n;quoted-printable:Li=C5=A1ka;Josef
org;quoted-printable:CHL po=C4=8D=C3=ADta=C4=8De, s.r.o.
adr;quoted-printable:;;Karla Majera 93;V=C5=A1enory;;252 31;Czech Republic
email;internet:[email protected]
title:root
tel;work:+420 272 048 055
tel;fax:+420 272 048 064
tel;cell:+420 776 026526
note:jabber: [email protected]
x-mozilla-html:FALSE
url:http://www.chl.cz
version:2.1
end:vcard

------------------------------------------------------------------------------
_______________________________________________
Dspam-user mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspam-user

Reply via email to