Hallo Dov! Have missed you on the ML for long time. Nice to see you back :)
-------- Original-Nachricht -------- > Datum: Wed, 26 Aug 2009 13:18:12 +0300 > Von: Dov Zamir <[email protected]> > An: [email protected] > Betreff: Re: [Dspam-user] high level of missed ham, but all factors at > 0.01000 > On 08/26/2009 01:00 PM, Sven Karlsson wrote: > > On Tue, Aug 25, 2009 at 10:27 PM, Steve<[email protected]> wrote: > >> > >> -------- Original-Nachricht -------- > >>> Datum: Tue, 25 Aug 2009 21:33:19 +0200 > >>> Von: Sven Karlsson<[email protected]> > >>> An: [email protected] > >>> Betreff: [Dspam-user] high level of missed ham, but all factors at > 0.01000 > > > >>> X-NS-Message-Id*BD74, 0.01000 > >>> > >> Uhh.. bad, bad, bad! I see to much HTML tags there. This is sure not > DSPAM 3.9.0. Right? > > > > No, 3.6.8 as you noted below. But isn't it strange that the most > > significant tokens are at 0.01, and it is still considered spam? > > > >>> Other strangeness: most factors displayed seems to be from the header, > >>> such as month*day pairs (although not in this example). I would assume > >>> that the email content would account for better indication of > >>> ham/spam. > >>> > >> That is sure true but you probably use one of the Bayesian algorithms > and they only use the most significant tokens (15 tokens and up but not > endless up). If you want all tokens to be considered then you should use naïve > as this would process all tokens. > > > > > > Ok. > > > >>> Even more strangeness: The "improbability drive" shows "1 in 151 > >>> chance of being ham" or "1 in 151 chance of being spam" in 95% of the > >>> cases (of 2146 examined emails). I would expect a lot more variation > >>> here. Does this indicate a problem? > >>> > >> YES! Something is not right with the statistical counters. Is that > issue only on your setup or do you have other users having the same issue? > > > > This was for all users. > > > >> > >> > >>> The setup scenario is for about 1000 mailboxes, using a global user, > >>> TOE training and initial corpus of about 5000 manually sorted > >>> spam/ham. There is a central periodic TOE training done about once a > >>> week for a sample of all messages, training the globaluser. > >>> > >> I don't understand this. What are you training once a week? New and > fresh set of HAM/SPAM or the same manually sorted 5000 HAM/SPAM messages? > > > > > > New email; one admin goes through a global mailbox and retrains the > > obvious missed spam and hams. This means that not all FP/FN are > > retrained, but it should be OK since its TOE training (even though > > some accuracy is lost). It also means that training may be focused on > > for example certain days of the week (the admin doing the training is > > more alert when starting at the monday emails, but may stop training > > at wednesday emails, leaving thursday-sunday untrained. This may give > > an unfair balance I assume. > > wow! I'm realy surprised that dspam is doing as well as this, given the > sporadic training you're doing. How can you expect dspam to know what is > spam and what isn't spam/ham if some of the errors go without > retraining? Sounds to me that you are really confusing the engine, since > an untrained error contributes to further errors. > > Ahhh... it's not that easy. He is using TOE and an untrained error does not make things worse. It just does not help to fix the issue but it's not making the engine classifying worse. It is slightly, slightly, slightly weakening the engine classification since it will update the statistical numbers but it's not that hard as it would be with TEFT. > >> > >> > >>> Algorithm graham burton > >>> > >> AHA! So there we are. That's the reason for the reduced amount of > tokens on the show factors output. This is btw nothing bad. It's not > necessarily > needed to process all tokens to get a good result. > > > > Ok. > > > >> > >> > >>> PValue graham > >>> > >> Uhh... if you have that in PValue then this must be DSPAM 3.6.8 or > less. Am I right? > >> > >> > >>> libmysql drv storage driver > >>> > >>> Using dspam 3.6.8 shipped with Debian. > >>> > >> Aha. Yes. I was right. DSPAM 3.6.8. Have you considered updating your > DSPAM setup? 3.8.0 at least. DSPAM 3.6.8 does not offer you much to improve > your situation you currently are facing. > > > > Can 3.8.0 be used in production? I was thinking of moving directly to > > 3.9.0, but I'm unsure about the stability.... Users are already > > calling and complaining about ham ending up in the spamboxes :) > > > >> Beside the 3.6.8 version of DSPAM? Not much (if at all). From what I > see above you can't much improve your situation with 3.6.8. > >> > >> > >>> Any way to debug the factors/tokens? > >>> > >> Debug in what way? > > > > Such as why tokens with 0.01 probability end up as spam (or maybe I > > dont understand this correctly, but I've seen the v*gra tokens having > > like 0.96 probability, which is more understandable..). > > > > > > Maybe there is some problem with the global group/user handling? (i.e. > > users are normally not training themselves.) > > Should retraining be done with dspam --user globaluser or no user > > setting at all? (only using the uid in the signature). > > > > I have also tried to first do a reclassification with source=error, > > and also tried retraining them instead as corpus, after removing the > > previous dspam header and signature data. Maybe this has a negative > > impact on the statistics? > > > > BR, > > Sven > > > > > ------------------------------------------------------------------------------ > > Let Crystal Reports handle the reporting - Free Crystal Reports 2008 > 30-Day > > trial. Simplify your report design, integration and deployment - and > focus on > > what you do best, core application coding. Discover what's new with > > Crystal Reports now. http://p.sf.net/sfu/bobj-july > > > > Dspam-user mailing list > > [email protected] > > https://lists.sourceforge.net/lists/listinfo/dspam-user > > > > !DSPAM:500,4a950a2d260342283011438! > > > > > ------------------------------------------------------------------------------ > Let Crystal Reports handle the reporting - Free Crystal Reports 2008 > 30-Day > trial. Simplify your report design, integration and deployment - and focus > on > what you do best, core application coding. Discover what's new with > Crystal Reports now. http://p.sf.net/sfu/bobj-july > _______________________________________________ > Dspam-user mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/dspam-user -- Jetzt kostenlos herunterladen: Internet Explorer 8 und Mozilla Firefox 3 - sicherer, schneller und einfacher! http://portal.gmx.net/de/go/atbrowser ------------------------------------------------------------------------------ Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july _______________________________________________ Dspam-user mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dspam-user
