-------- Original-Nachricht -------- > Datum: Wed, 26 Aug 2009 12:00:59 +0200 > Von: Sven Karlsson <[email protected]> > An: [email protected] > Betreff: Re: [Dspam-user] high level of missed ham, but all factors at > 0.01000
> On Tue, Aug 25, 2009 at 10:27 PM, Steve<[email protected]> wrote: > > > > -------- Original-Nachricht -------- > >> Datum: Tue, 25 Aug 2009 21:33:19 +0200 > >> Von: Sven Karlsson <[email protected]> > >> An: [email protected] > >> Betreff: [Dspam-user] high level of missed ham, but all factors at > 0.01000 > > >> X-NS-Message-Id*BD74, 0.01000 > >> > > Uhh.. bad, bad, bad! I see to much HTML tags there. This is sure not > DSPAM 3.9.0. Right? > > No, 3.6.8 as you noted below. But isn't it strange that the most > significant tokens are at 0.01, and it is still considered spam? > There are many calculations responsible for the classification. It's not that easy to think that just because a token has 0.01 it should have less weight. However... if you have a lot of 0.01 then that means that all the other tokens of the mail in question have less and they are not significant. If you look inside 3.6.8 code then you will see that DSPAM will use 27 most significant tokens for Burton, 25 for Robinson and 15 for all other Bayesian classifiers (Markov has it's own count). So in your case the 27 most significant tokens mostly have 0.01. Either most of your mails have tokens which DSPAM has never seen before or you have to many tokens in your DSPAM that are to frequently seen in all mails regardless of their class. > >> Other strangeness: most factors displayed seems to be from the header, > >> such as month*day pairs (although not in this example). I would assume > >> that the email content would account for better indication of > >> ham/spam. > >> > > That is sure true but you probably use one of the Bayesian algorithms > and they only use the most significant tokens (15 tokens and up but not > endless up). If you want all tokens to be considered then you should use naïve > as this would process all tokens. > > > Ok. > > >> Even more strangeness: The "improbability drive" shows "1 in 151 > >> chance of being ham" or "1 in 151 chance of being spam" in 95% of the > >> cases (of 2146 examined emails). I would expect a lot more variation > >> here. Does this indicate a problem? > >> > > YES! Something is not right with the statistical counters. Is that issue > only on your setup or do you have other users having the same issue? > > This was for all users. > That is not good. I would suggest you to FORCE a training with your SPAM/HAM corpus you got over there (the one with 5000 mails) but don't use TOE. Use TEFT and force DSPAM to learn all of them. Maybe better would be to inoculate them. For example like this (assuming you have a directory ham and a directory spam and your global user is called "globaluser"): for foo in ./ham/* ; do dspam --user globaluser --class=innocent --source=inoculation --deliver=summary --stdout < ${foo} ; done for foo in ./spam/* ; do dspam --user globaluser --class=spam --source=inoculation --deliver=summary --stdout < ${foo} ; done > > > > > >> The setup scenario is for about 1000 mailboxes, using a global user, > >> TOE training and initial corpus of about 5000 manually sorted > >> spam/ham. There is a central periodic TOE training done about once a > >> week for a sample of all messages, training the globaluser. > >> > > I don't understand this. What are you training once a week? New and > fresh set of HAM/SPAM or the same manually sorted 5000 HAM/SPAM messages? > > > New email; one admin goes through a global mailbox and retrains the > obvious missed spam and hams. This means that not all FP/FN are > retrained, but it should be OK since its TOE training (even though > some accuracy is lost). It also means that training may be focused on > for example certain days of the week (the admin doing the training is > more alert when starting at the monday emails, but may stop training > at wednesday emails, leaving thursday-sunday untrained. This may give > an unfair balance I assume. > And how is that global mailbox connected/related to other users? Is that global mailbox the mailbox of your global user? Does retraining there benefit other users or is it just for one user? > > > > > >> Algorithm graham burton > >> > > AHA! So there we are. That's the reason for the reduced amount of tokens > on the show factors output. This is btw nothing bad. It's not necessarily > needed to process all tokens to get a good result. > > Ok. > > > > > > >> PValue graham > >> > > Uhh... if you have that in PValue then this must be DSPAM 3.6.8 or less. > Am I right? > > > > > >> libmysql_drv storage driver > >> > >> Using dspam 3.6.8 shipped with Debian. > >> > > Aha. Yes. I was right. DSPAM 3.6.8. Have you considered updating your > DSPAM setup? 3.8.0 at least. DSPAM 3.6.8 does not offer you much to improve > your situation you currently are facing. > > Can 3.8.0 be used in production? > YES! While 3.8.0 might not be as polished as 3.9.0 is, it is still better then 3.6.8. Some months ago 3.8.0 was the top of the top and would we have not started to work on 3.9.0 then no one here would have the knowledge what all did not work or has issues in 3.8.0. Some distros had started to patch 3.8.0 and fix issues and some distros added new functionality but no one really had the big picture of what does not work in 3.8.0. However... as far as I can remember most issues with 3.8.0 are issues which where already present in 3.6.8. The only thing I remember which is new in 3.8.0 and was not there in 3.6.8 is one issue how spam-/ham aliases for retraining are handled. But that's it. All other issues present in 3.8.0 where already present in 3.6.8. So I would say that you are save to go with 3.8.0. > I was thinking of moving directly to > 3.9.0, but I'm unsure about the stability.... > Look! I am not the right person to say about stability of 3.9.0. I am to much involved in 3.9.0 to have a objective viewpoint. While I think that I have a objective viewpoint I know that I can't have it since much code is coming from me and this voids my objectivity. That said, I can assure you that stability in 3.9.0 is better then the stability of 3.6.8. 99.9% of all software will never be 100% bug free and DSPAM is no exception. However... much bug fixing has been done on 3.9.0 to increase stability of DSPAM. If you go and look inside our Bug Tracker you will see many tickets about DSPAM crashing, DSPAM not being stable, etc... and you will see that all of those issues have been addressed and fixed and you will see that the original reporter is not reopening the ticket or opening a new ticket for the original issue. This is sure an indicator that stability in DSPAM is increasing and getting better with each commit to our repository. If I would be in your situation then I would download DSPAM 3.9.0 BETA and install it on a test system and just train it with those 5000 mails you have there and then take my mailbox and push all mails over 3.9.0 and look how well it scores. If the result is satisfying me then I would go on and do a migration of my account from the production to the 3.9.0 test system. And I would look where I have issues while doing that test migration and look how I can fix them. If that would go well then I would go on and try to migrate 1/10 of all mailboxes and look what other issues I have while doing that test migration and again I would look how to solve them. After having done all those tasks I would have all needed informations to sit down and actually plan the whole migration for my productive 3.6.8 system. I would have enough numbers to know how long will be the downtime of my production while moving to 3.9.0 (or 3.8.0. Depending on which version you tested), I would know what issues I could possibly face and how to solve them, I would know if I can reuse my old tokens or if I should start from zero, etc... > Users are already > calling and complaining about ham ending up in the spamboxes :) > Now it just depends how much those users put pressure on you and how much a happy customer is worth for you compared to the work needed to migrate your system to 3.8.0 or 3.9.0. > > Beside the 3.6.8 version of DSPAM? Not much (if at all). From what I see > above you can't much improve your situation with 3.6.8. > > > > > >> Any way to debug the factors/tokens? > >> > > Debug in what way? > > Such as why tokens with 0.01 probability end up as spam (or maybe I > dont understand this correctly, but I've seen the v*gra tokens having > like 0.96 probability, which is more understandable..). > Aha. It's pure mathematics. The algorithms used inside DSPAM are responsible for the computation of the probability. Running DSPAM in debug mode will show you the end result but it will not show the full blown up computation for the probability. You will just see that token "ABC" has probability 0.01 but you will not see the computation saying P(A|B) * P(A) / P(B) = 0.01. > Maybe there is some problem with the global group/user handling? (i.e. > users are normally not training themselves.) > Should retraining be done with dspam --user globaluser or no user > setting at all? (only using the uid in the signature). > YES! Retraining should be done with the user having the issue. Retraining the global user with a signature (and uid) not belonging to him is useless! If you have enabled retraining with uid then it's not important with which user you are retraining (as long as you allow the user with which you are retraining to do the retaining. Aka: You trust that user to do the retraining for other users) since internally DSPAM will anyway retrain the message for the proper user. > I have also tried to first do a reclassification with source=error, > and also tried retraining them instead as corpus, after removing the > previous dspam header and signature data. Maybe this has a negative > impact on the statistics? > You can't and should not use source=error with a signature that does not belong to the user you are training (if you have not enabled uid based retraining). You can however use source=corpus after removing the DSPAM headers and train then your global user but that does not have that huge effect you expect. Better is to really retrain the message with the user that got the message in the first place and only use source=corpus for the global user. > BR, > Sven > Steve -- Jetzt kostenlos herunterladen: Internet Explorer 8 und Mozilla Firefox 3 - sicherer, schneller und einfacher! http://portal.gmx.net/de/go/atbrowser ------------------------------------------------------------------------------ Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july _______________________________________________ Dspam-user mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dspam-user
