On 08/26/2009 01:00 PM, Sven Karlsson wrote: > On Tue, Aug 25, 2009 at 10:27 PM, Steve<[email protected]> wrote: >> >> -------- Original-Nachricht -------- >>> Datum: Tue, 25 Aug 2009 21:33:19 +0200 >>> Von: Sven Karlsson<[email protected]> >>> An: [email protected] >>> Betreff: [Dspam-user] high level of missed ham, but all factors at 0.01000 > >>> X-NS-Message-Id*BD74, 0.01000 >>> >> Uhh.. bad, bad, bad! I see to much HTML tags there. This is sure not DSPAM >> 3.9.0. Right? > > No, 3.6.8 as you noted below. But isn't it strange that the most > significant tokens are at 0.01, and it is still considered spam? > >>> Other strangeness: most factors displayed seems to be from the header, >>> such as month*day pairs (although not in this example). I would assume >>> that the email content would account for better indication of >>> ham/spam. >>> >> That is sure true but you probably use one of the Bayesian algorithms and >> they only use the most significant tokens (15 tokens and up but not endless >> up). If you want all tokens to be considered then you should use naïve as >> this would process all tokens. > > > Ok. > >>> Even more strangeness: The "improbability drive" shows "1 in 151 >>> chance of being ham" or "1 in 151 chance of being spam" in 95% of the >>> cases (of 2146 examined emails). I would expect a lot more variation >>> here. Does this indicate a problem? >>> >> YES! Something is not right with the statistical counters. Is that issue >> only on your setup or do you have other users having the same issue? > > This was for all users. > >> >> >>> The setup scenario is for about 1000 mailboxes, using a global user, >>> TOE training and initial corpus of about 5000 manually sorted >>> spam/ham. There is a central periodic TOE training done about once a >>> week for a sample of all messages, training the globaluser. >>> >> I don't understand this. What are you training once a week? New and fresh >> set of HAM/SPAM or the same manually sorted 5000 HAM/SPAM messages? > > > New email; one admin goes through a global mailbox and retrains the > obvious missed spam and hams. This means that not all FP/FN are > retrained, but it should be OK since its TOE training (even though > some accuracy is lost). It also means that training may be focused on > for example certain days of the week (the admin doing the training is > more alert when starting at the monday emails, but may stop training > at wednesday emails, leaving thursday-sunday untrained. This may give > an unfair balance I assume.
wow! I'm realy surprised that dspam is doing as well as this, given the sporadic training you're doing. How can you expect dspam to know what is spam and what isn't spam/ham if some of the errors go without retraining? Sounds to me that you are really confusing the engine, since an untrained error contributes to further errors. > >> >> >>> Algorithm graham burton >>> >> AHA! So there we are. That's the reason for the reduced amount of tokens on >> the show factors output. This is btw nothing bad. It's not necessarily >> needed to process all tokens to get a good result. > > Ok. > >> >> >>> PValue graham >>> >> Uhh... if you have that in PValue then this must be DSPAM 3.6.8 or less. Am >> I right? >> >> >>> libmysql drv storage driver >>> >>> Using dspam 3.6.8 shipped with Debian. >>> >> Aha. Yes. I was right. DSPAM 3.6.8. Have you considered updating your DSPAM >> setup? 3.8.0 at least. DSPAM 3.6.8 does not offer you much to improve your >> situation you currently are facing. > > Can 3.8.0 be used in production? I was thinking of moving directly to > 3.9.0, but I'm unsure about the stability.... Users are already > calling and complaining about ham ending up in the spamboxes :) > >> Beside the 3.6.8 version of DSPAM? Not much (if at all). From what I see >> above you can't much improve your situation with 3.6.8. >> >> >>> Any way to debug the factors/tokens? >>> >> Debug in what way? > > Such as why tokens with 0.01 probability end up as spam (or maybe I > dont understand this correctly, but I've seen the v*gra tokens having > like 0.96 probability, which is more understandable..). > > > Maybe there is some problem with the global group/user handling? (i.e. > users are normally not training themselves.) > Should retraining be done with dspam --user globaluser or no user > setting at all? (only using the uid in the signature). > > I have also tried to first do a reclassification with source=error, > and also tried retraining them instead as corpus, after removing the > previous dspam header and signature data. Maybe this has a negative > impact on the statistics? > > BR, > Sven > > ------------------------------------------------------------------------------ > Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day > trial. Simplify your report design, integration and deployment - and focus on > what you do best, core application coding. Discover what's new with > Crystal Reports now. http://p.sf.net/sfu/bobj-july > > Dspam-user mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/dspam-user > > !DSPAM:500,4a950a2d260342283011438! > ------------------------------------------------------------------------------ Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july _______________________________________________ Dspam-user mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dspam-user
