-------- Original-Nachricht -------- > Datum: Fri, 31 Jul 2009 09:21:06 +0200 > Von: Sebastian Toepfer <[email protected]> > An: [email protected] > Betreff: Re: [Dspam-user] Upgrade dspam 3.6.8 to 3.9.0-git
> Hallo Steve, > On Fri, 31 Jul 2009 01:09:18 +0200, "Steve" <[email protected]> wrote: > > -------- Original-Nachricht -------- > >> Datum: Thu, 30 Jul 2009 19:01:37 +0200 > >> Von: "Sebastian Toepfer" <[email protected]> > >> An: [email protected] > >> Betreff: Re: [Dspam-user] Upgrade dspam 3.6.8 to 3.9.0-git > > > >> Hello Steve, > >> > > Hallo Sebastian > > > > > >> thanks, my holliday is rescued :) > >> > > Why? What have I written so good to rescue your holiday? > > > > Alles - jetzt kann ich die umstellung machen und habe auch noch was von > meinem urlaub :) > Aha. Jetzt verstehe ich. Irgend wie scheine ich im SLL (Super-Lange-Leitung) Modus zu sein. > [...] > > > >> > > >> >> to change the tokinzier without retrain for the > >> >> users. Because I use dspam at home and the "user" have train dspam > >> about > >> >> (3)years and the kill me if the must do this again :( > >> >> > >> > If I understand that right you are asking if you could shorten the > >> > training for the new installation by using old data. Right? Yes! You > >> > can > >> > do that. You could dump or copy the old data and import it on the new > >> > installation. But if I see that right then you are planing to change > >> > the > >> > tokenizer and changing tokenizer mostly means that old data is > useless. > >> > > >> > >> bad news ... I've read thats other tokinzier are better, > >> > > Better in what? If it would be so clear which tokenizer is the best then > we > > would probably remove all the others. But it's not that easy. For some > > setups tokenizer A is better then tokenizer B and so on... > > > > No, I don't know which is the best. But found some references to use OSB > or > SBPH. > If you are not using the Hash driver, then SBPH is out of discussion for you. > > > >> why it's not > >> possiblie to migrate the data from one tokinzier to another? It's a > >> problem > >> how dspam create this token - it's only one way? > >> > > Yep. The reason is very easy: > > 1) Not all tokenizers use the same schema/pattern > > 2) There is no chain information saved inside the token > > 3) Computing from normal text to token is easy but way back is hard > > > > > > I am now going to explain deeply how the tokenizers do create the > > tokens/patterns. I do that because I hope new users will search the > > mailinglist archives and stop asking over and over the same question. I > > will just show the token generating part. Internally DSPAM uses > algorithms > > for calculating the probability and the confidence factor. I am not > going > > to explain the later two parts. Just the token creation. Beside the > token > > creations DSPAM uses different weight on the generated tokens depending > > which tokenizer is used. I am as well not going to explain that. I have > > done that already in the past and the info about the weight of the > tokens > > inside the tokenizers is explained there. If you need that info then > please > > search the mailinglist and read there more about it. > > > > > > So now the technical mambo-jamob. Let me explain: > > -------------------------------------------------- > [...] > Thanks for this explaination. It's all clear now. > Perfect. My goal is reached. > > > > > > > >> > 3 years of data is all fine and okay but to be honest you will not > >> > loose > >> > much. Just the first days will lead to more training but after a > short > > >> > time DSPAM will catch up and be very accurate. > >> > > >> > >> It's a small installation only ca. 30.000 mails in this 3 years ... and > >> 20.000 own by me :) .. so I think it's take a year to reach current > >> accurate. > >> > > No way. A year? NEVER! Expect a bunch of corrections (in the 2-digit > area) > > and you would be already easy above 90% or even 95%. Just take something > > like OSB or CHAIN. Don't go with WORD in your case. > > > See my question to tokinizer, I'll switch it and after this answer. I do > it > an remove all training data. > Correctly. Start fresh. It will hurt but not that much. Just some days of retraining and then from time to time correcting some errors and that's it. But in no way you need to spend the next year just doing corrections. > > > >> Or what do you think how long it takes with this low volume? > >> E.g. > >> one user has only 700 Ham but 1500Spam (accurance 91.40% - she loves > >> dspam > >> :)). > >> > > Not much time. Really. And you still could pretrain a merged or > > shared,merged group and speedup the process. You can find SPAM corpi > > everywhere on the net (es gibt sie (fast) wie Sand am Meer). > > > But which is the best for german user, where one user receive english > newsletter/mailinglist. > There is no "best for German" or "best for language XYZ". They all are good. I mean all the tokenizers are good. > All I've tested result in bad accurance, I hate > false negatives is the worst thing a spamfilter can do, > False Negatives being worst thing? In my experience users hate False Positives (HAM message tagged/classified as SPAM) much more. In DSPAM you have a lot of possibilities to tweak the filter. If you hate so much False Negatives then set training buffer to 0 and DSPAM will tag more aggressive messages as SPAM (with the negative consequence of probably tagging more legitimate messages (aka HAM) as SPAM too). > see gmx .. you > must > check all your spams daily to found the newsletter :(. > I am +/- in your situation. Most mail is German while newsletters are English and communication with some vendors are in English too. After a bunch of corrections my DSPAM tags the messages (either the German ones or the English newsletters) very accurate. > If false negative > on > a low level then the user check quarantine once a week/month and all its > okay. > If you are using the purge script for MySQL (you use MySQL. Right?) then once a week or biweekly is okay but anything after that is not okay since signatures will be removed if older then 14/15 days. > > > >> > > >> >> any other pitfalls? > >> >> > >> > Not really. > >> > > >> > >> Very good news. > >> > > :) > > > > > >> > > >> >> I use dspam with mysql as backend and without groups. > >> >> > >> > If you have many users then using groups could help to shorten > training > >> > > >> > time. > >> > > >> > >> Only 5 user with very different mails. My old solution was a single > user > > >> spamfilter which result in very very bad accurance. I've found dspam an > >> surprised how well it works (200 or 300 mails and it rocks)! The > learning > >> > >> with forwarding was a other big hit, beause we use pop3 and how should > we > >> > >> train the filter which run on a gateway? > >> > > Either with the DSPAM Web UI or directly from within the email client > (we > > have plugins for Mozilla Thunderbird, Lotus Notes and Microsoft Outlook > > (and possibly others. Just ask here and I am sure someone has made > > something you could reuse)). > > > Wasn't a real question (I sagte doch das ich kein englisch schreiben kann > Schreiben kannst Du aber ich verstehe es nicht immer. Ich meine ich verstehe nicht was Du sagen willst. > :(). More a feature why use dspam, because the ways to do this, work > out-of-the box :) > Okay setup web-gui is more a ..., but have see that a replacement is in > work/plan. > > Please write what is bad with the current Web UI. The replacement could profit from your feedback. > > > >> Sebastian > >> > > Steve > thanks again, > Sebastian > Steve > ps: it's posible to set the replay-to on this mailinglist to: > [email protected]? I click only answer and then only the > one > the wrote the mail are in the to field :( > Something Alexander or Paul needs to do. > ------------------------------------------------------------------------------ > Let Crystal Reports handle the reporting - Free Crystal Reports 2008 > 30-Day > trial. Simplify your report design, integration and deployment - and focus > on > what you do best, core application coding. Discover what's new with > Crystal Reports now. http://p.sf.net/sfu/bobj-july > _______________________________________________ > Dspam-user mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/dspam-user -- GRATIS für alle GMX-Mitglieder: Die maxdome Movie-FLAT! Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome01 ------------------------------------------------------------------------------ Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july _______________________________________________ Dspam-user mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dspam-user
