Hi Steve,
your point on non writing signatures to db during training is perfectly
clear, in fact I was
thinking about it some time ago.
And yes, I'd appreciate your training script.
My current and still working installation of old dspam uses one shared
group for all users.
Now I am trying to make userless setup on filter machine placed in front
of real imap server(s).
My idea is to replace bayes filter in spamassasin with dspam. Bayes in
SA is blackbox for me and it's performance seems to be really suboptimal.
In my experience users do not train, which leads to poor results. I have
even seen user, who retrained dspam to deliver almost all spam an eat
almost all ham.
Currently I use spamassasin for most of users (they do not have to care
about it) and dspam (on imap server) for me and few experienced users.
Next step would be to leverage dspam classification maintained by few
experienced users so that other users will get more acurate results without
having to know about dspam at all.
Steve napsal(a):
Hello Steve,
Ahoi Josef,
I am quite old user of dspam (I started with 3.6.x) and now I am
cosidering trying 3.9.0 Alpha.
:)
I personally think that dspam should not do any special checking for
configuration sanity, but I docs should be updated with
sample configs for common scenarios and some care should be taken to
help user to choose best tokenizer/algo/storage/trainmode for
their scenario.
I agree.
I have chosen years ago train on error, algorithm graham burton and
mysql storage.
Good choice.
Mysql was a choice for me because I wanted daemon mode and it is thread
safe.
I have chosen train on error because I wanted to keep token data small.
Some info on this is on dspam wiki, but it is generally hard to find
which combination to start with.
I would like to install new dspam to one of our filter servers, which
handles cca 4000 hams and 3000 spams daily.
I'd probably use mysql again, becouse I know it quite well. I'd try teft
training mode, I do not expect big load on this server.
I am not sure about tokenizer and algo. My preference is to get very low
number of false positives.
I think I should try to start with osb or chain tokenizer, graham+burton
and markov.
Markov is out of the discussion if you use MySQL. Go for OSB.
Stay with TOE since it will internally anyway be TEFT and in some cases TUM
until you reach a certain statical level.
For the algorithm I would suggest you to use during corpus training "graham burton naive" and then later for production
I would suggest you to choose just "graham burton". The reason why I suggest you to use "naive" during corpus
training is because "naive" will lead to more false positive and false negative but that's fine during corpus training.
Later in production "naive" is better removed if you are not using Markov.
For the tokenizer you can go with chain but then you need to stay on chain
because switching later to OSB will render your training useless. So start from
the beginning with OSB and stay there.
I have a training script which does double sided training and TONE (train on
error or near error) training and it uses a training threshold. It is much
faster then the original training script because it does NOT write anything to
the signature table. I avoid that because in corpus training I don't need DSPAM
to write there since I anyway have the mail available in raw format.
The double side training is a method to get DSPAM to learn faster what spam/ham is by
unlearning a message from the other class in case of errors. For example: Assume you have
a mail having the words: "Buy cheep Viagra here http://www.domain.tld/"
Now should DSPAM say that this mail is INNOCENT then I have a False Negative
and my script goes on and:
1) learns the message as SPAM
2) unlearns the tokens from INNOCENT
And it does the same with a False Positive but in case of a FP it does:
1) learns the message as INNOCENT
2) unlearns the tokens from SPAM
The TONE training is done by defining a threshold. I use a symmetric threshold.
To illustrate what I mean I am going to make an example with a threshold of 40.
Now assume you have a message SPAM001 and assume that DSPAM is saying that it
is indeed spam but it says that the confidence is 0.15 (aka 15%) then my
training script is going to LEARN that message anyway because 0.15 is smaller
then my threshold of 40 (aka 0.40). And the same is done with a HAM message.
That way of training helps me to keep the data very low. I just did some weeks
ago a training for a new setup where I have processed around 2 to 3 million
mails (the corpus is pretty good balanced with 50% spam and 50% ham) and the
result after training was:
TP True Positives: 0
TN True Negatives: 0
FP False Positives: 0
FN False Negatives: 0
SC Spam Corpusfed: 5539
NC Nonspam Corpusfed: 9908
TL Training Left: 0
SHR Spam Hit Rate 100.00%
HSR Ham Strike Rate: 100.00%
PPV Positive predictive value: 100.00%
OCA Overall Accuracy: 100.00%
Inside MySQL the statistical data for that user looks like that:
mysql> select * from dspam_stats where uid=2\G
*************************** 1. row ***************************
uid: 2
spam_learned: 1615
innocent_learned: 4369
spam_misclassified: 0
innocent_misclassified: 0
spam_corpusfed: 5539
innocent_corpusfed: 9908
spam_classified: 0
innocent_classified: 0
1 row in set (0.00 sec)
mysql>
So that's below 10'000 mails trained for 2 to 3 million corpus mails. That's
below 0.5% of all corpus mails that needed to be trained. As you see real
training was just for 1'615 spam mails and 4'369 innocent mails. But I have set
a forced training loop to train a FP/FN up to 5 times until it got the right
class. So if a message would result in FP or FN then I train the message up to
5 times (the number is configurable on the command line when calling the
training script) until the message is correctly classified but not more then 5
times. That's like TestConditionalTraining but done by hand since I like to
control my self the frequency of a retraining and not let
TestConditionalTraining do the loop internally.
The statistical output above is misleading since it says 100.00% accuracy but
it is not 100% accurate. I mean the training did not reach 100% accuracy
because the SC and NC tells me that there where errors which resulted in
training. The 100% come from the way I train.
I can share the training script if you should need it.
The setup you are planing to build: Is it going to be just for you or a bunch
of other people too? If so: Have you considered using groups in DSPAM? That
would help your other users to reach faster a certain accuracy then without
using groups.
--
S pozdravem
Cheers,
Josef Liška
Steve
--
S pozdravem
Josef Liška
CHL | system care
Telefon: +420.272048055
Fax: +420.272048064
Mobil: +420.776026526 denně 9:00 - 17:30
Jabber: [email protected]
https://www.chl.cz/
begin:vcard
fn;quoted-printable:Josef Li=C5=A1ka
n;quoted-printable:Li=C5=A1ka;Josef
org;quoted-printable:CHL po=C4=8D=C3=ADta=C4=8De, s.r.o.
adr;quoted-printable:;;Karla Majera 93;V=C5=A1enory;;252 31;Czech Republic
email;internet:[email protected]
title:root
tel;work:+420 272 048 055
tel;fax:+420 272 048 064
tel;cell:+420 776 026526
note:jabber: [email protected]
x-mozilla-html:FALSE
url:http://www.chl.cz
version:2.1
end:vcard
------------------------------------------------------------------------------
_______________________________________________
Dspam-user mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspam-user