> Hello Steve, > Ahoi Josef,
> I am quite old user of dspam (I started with 3.6.x) and now I am > cosidering trying 3.9.0 Alpha. > :) > I personally think that dspam should not do any special checking for > configuration sanity, but I docs should be updated with > sample configs for common scenarios and some care should be taken to > help user to choose best tokenizer/algo/storage/trainmode for > their scenario. > I agree. > I have chosen years ago train on error, algorithm graham burton and > mysql storage. > Good choice. > Mysql was a choice for me because I wanted daemon mode and it is thread > safe. > I have chosen train on error because I wanted to keep token data small. > > Some info on this is on dspam wiki, but it is generally hard to find > which combination to start with. > > I would like to install new dspam to one of our filter servers, which > handles cca 4000 hams and 3000 spams daily. > I'd probably use mysql again, becouse I know it quite well. I'd try teft > training mode, I do not expect big load on this server. > > I am not sure about tokenizer and algo. My preference is to get very low > number of false positives. > I think I should try to start with osb or chain tokenizer, graham+burton > and markov. > Markov is out of the discussion if you use MySQL. Go for OSB. Stay with TOE since it will internally anyway be TEFT and in some cases TUM until you reach a certain statical level. For the algorithm I would suggest you to use during corpus training "graham burton naive" and then later for production I would suggest you to choose just "graham burton". The reason why I suggest you to use "naive" during corpus training is because "naive" will lead to more false positive and false negative but that's fine during corpus training. Later in production "naive" is better removed if you are not using Markov. For the tokenizer you can go with chain but then you need to stay on chain because switching later to OSB will render your training useless. So start from the beginning with OSB and stay there. I have a training script which does double sided training and TONE (train on error or near error) training and it uses a training threshold. It is much faster then the original training script because it does NOT write anything to the signature table. I avoid that because in corpus training I don't need DSPAM to write there since I anyway have the mail available in raw format. The double side training is a method to get DSPAM to learn faster what spam/ham is by unlearning a message from the other class in case of errors. For example: Assume you have a mail having the words: "Buy cheep Viagra here http://www.domain.tld/" Now should DSPAM say that this mail is INNOCENT then I have a False Negative and my script goes on and: 1) learns the message as SPAM 2) unlearns the tokens from INNOCENT And it does the same with a False Positive but in case of a FP it does: 1) learns the message as INNOCENT 2) unlearns the tokens from SPAM The TONE training is done by defining a threshold. I use a symmetric threshold. To illustrate what I mean I am going to make an example with a threshold of 40. Now assume you have a message SPAM001 and assume that DSPAM is saying that it is indeed spam but it says that the confidence is 0.15 (aka 15%) then my training script is going to LEARN that message anyway because 0.15 is smaller then my threshold of 40 (aka 0.40). And the same is done with a HAM message. That way of training helps me to keep the data very low. I just did some weeks ago a training for a new setup where I have processed around 2 to 3 million mails (the corpus is pretty good balanced with 50% spam and 50% ham) and the result after training was: TP True Positives: 0 TN True Negatives: 0 FP False Positives: 0 FN False Negatives: 0 SC Spam Corpusfed: 5539 NC Nonspam Corpusfed: 9908 TL Training Left: 0 SHR Spam Hit Rate 100.00% HSR Ham Strike Rate: 100.00% PPV Positive predictive value: 100.00% OCA Overall Accuracy: 100.00% Inside MySQL the statistical data for that user looks like that: mysql> select * from dspam_stats where uid=2\G *************************** 1. row *************************** uid: 2 spam_learned: 1615 innocent_learned: 4369 spam_misclassified: 0 innocent_misclassified: 0 spam_corpusfed: 5539 innocent_corpusfed: 9908 spam_classified: 0 innocent_classified: 0 1 row in set (0.00 sec) mysql> So that's below 10'000 mails trained for 2 to 3 million corpus mails. That's below 0.5% of all corpus mails that needed to be trained. As you see real training was just for 1'615 spam mails and 4'369 innocent mails. But I have set a forced training loop to train a FP/FN up to 5 times until it got the right class. So if a message would result in FP or FN then I train the message up to 5 times (the number is configurable on the command line when calling the training script) until the message is correctly classified but not more then 5 times. That's like TestConditionalTraining but done by hand since I like to control my self the frequency of a retraining and not let TestConditionalTraining do the loop internally. The statistical output above is misleading since it says 100.00% accuracy but it is not 100% accurate. I mean the training did not reach 100% accuracy because the SC and NC tells me that there where errors which resulted in training. The 100% come from the way I train. I can share the training script if you should need it. The setup you are planing to build: Is it going to be just for you or a bunch of other people too? If so: Have you considered using groups in DSPAM? That would help your other users to reach faster a certain accuracy then without using groups. > -- > > S pozdravem > Cheers, > Josef Liška > Steve -- GRATIS für alle GMX-Mitglieder: Die maxdome Movie-FLAT! Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome01 ------------------------------------------------------------------------------ _______________________________________________ Dspam-user mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dspam-user
