Re: [Dspam-user] high level of missed ham, but all factors at 0.01000

Steve Wed, 26 Aug 2009 04:23:07 -0700

-------- Original-Nachricht --------
> Datum: Wed, 26 Aug 2009 12:00:59 +0200
> Von: Sven Karlsson <[email protected]>
> An: [email protected]
> Betreff: Re: [Dspam-user] high level of missed ham,   but all factors at 
> 0.01000


> On Tue, Aug 25, 2009 at 10:27 PM, Steve<[email protected]> wrote:
> >
> > -------- Original-Nachricht --------
> >> Datum: Tue, 25 Aug 2009 21:33:19 +0200
> >> Von: Sven Karlsson <[email protected]>
> >> An: [email protected]
> >> Betreff: [Dspam-user] high level of missed ham, but all factors at
> 0.01000
> 
> >>         X-NS-Message-Id*BD74, 0.01000
> >>
> > Uhh.. bad, bad, bad! I see to much HTML tags there. This is sure not
> DSPAM 3.9.0. Right?
> 
> No, 3.6.8 as you noted below. But isn't it strange that the most
> significant tokens are at 0.01, and it is still considered spam?
> 
There are many calculations responsible for the classification. It's not that 
easy to think that just because a token has 0.01 it should have less weight. 
However... if you have a lot of 0.01 then that means that all the other tokens 
of the mail in question have less and they are not significant. If you look 
inside 3.6.8 code then you will see that DSPAM will use 27 most significant 
tokens for Burton, 25 for Robinson and 15 for all other Bayesian classifiers 
(Markov has it's own count). So in your case the 27 most significant tokens 
mostly have 0.01. Either most of your mails have tokens which DSPAM has never 
seen before or you have to many tokens in your DSPAM that are to frequently 
seen in all mails regardless of their class.


> >> Other strangeness: most factors displayed seems to be from the header,
> >> such as month*day pairs (although not in this example). I would assume
> >> that the email content would account for better indication of
> >> ham/spam.
> >>
> > That is sure true but you probably use one of the Bayesian algorithms
> and they only use the most significant tokens (15 tokens and up but not
> endless up). If you want all tokens to be considered then you should use naïve
> as this would process all tokens.
> 
> 
> Ok.
> 
> >> Even more strangeness: The "improbability drive" shows "1 in 151
> >> chance of being ham" or "1 in 151 chance of being spam" in 95% of the
> >> cases (of 2146 examined emails). I would expect a lot more variation
> >> here. Does this indicate a problem?
> >>
> > YES! Something is not right with the statistical counters. Is that issue
> only on your setup or do you have other users having the same issue?
> 
> This was for all users.
> 
That is not good. I would suggest you to FORCE a training with your SPAM/HAM 
corpus you got over there (the one with 5000 mails) but don't use TOE. Use TEFT 
and force DSPAM to learn all of them. Maybe better would be to inoculate them. 
For example like this (assuming you have a directory ham and a directory spam 
and your global user is called "globaluser"):

for foo in ./ham/* ; do dspam --user globaluser --class=innocent 
--source=inoculation --deliver=summary --stdout < ${foo} ; done

for foo in ./spam/* ; do dspam --user globaluser --class=spam 
--source=inoculation --deliver=summary --stdout < ${foo} ; done


> >
> >
> >> The setup scenario is for about 1000 mailboxes, using a global user,
> >> TOE training and initial corpus of about 5000 manually sorted
> >> spam/ham. There is a central periodic TOE training done about once a
> >> week for a sample of all messages, training the globaluser.
> >>
> > I don't understand this. What are you training once a week? New and
> fresh set of HAM/SPAM or the same manually sorted 5000 HAM/SPAM messages?
> 
> 
> New email; one admin goes through a global mailbox and retrains the
> obvious missed spam and hams. This means that not all FP/FN are
> retrained, but it should be OK since its TOE training (even though
> some accuracy is lost). It also means that training may be focused on
> for example certain days of the week (the admin doing the training is
> more alert when starting at the monday emails, but may stop training
> at wednesday emails, leaving thursday-sunday untrained. This may give
> an unfair balance I assume.
> 
And how is that global mailbox connected/related to other users? Is that global 
mailbox the mailbox of your global user? Does retraining there benefit other 
users or is it just for one user?


> >
> >
> >> Algorithm graham burton
> >>
> > AHA! So there we are. That's the reason for the reduced amount of tokens
> on the show factors output. This is btw nothing bad. It's not necessarily
> needed to process all tokens to get a good result.
> 
> Ok.
> 
> >
> >
> >> PValue graham
> >>
> > Uhh... if you have that in PValue then this must be DSPAM 3.6.8 or less.
> Am I right?
> >
> >
> >> libmysql_drv storage driver
> >>
> >> Using dspam 3.6.8 shipped with Debian.
> >>
> > Aha. Yes. I was right. DSPAM 3.6.8. Have you considered updating your
> DSPAM setup? 3.8.0 at least. DSPAM 3.6.8 does not offer you much to improve
> your situation you currently are facing.
> 
> Can 3.8.0 be used in production?
>
YES! While 3.8.0 might not be as polished as 3.9.0 is, it is still better then 
3.6.8. Some months ago 3.8.0 was the top of the top and would we have not 
started to work on 3.9.0 then no one here would have the knowledge what all did 
not work or has issues in 3.8.0. Some distros had started to patch 3.8.0 and 
fix issues and some distros added new functionality but no one really had the 
big picture of what does not work in 3.8.0. However... as far as I can remember 
most issues with 3.8.0 are issues which where already present in 3.6.8. The 
only thing I remember which is new in 3.8.0 and was not there in 3.6.8 is one 
issue how spam-/ham aliases for retraining are handled. But that's it. All 
other issues present in 3.8.0 where already present in 3.6.8. So I would say 
that you are save to go with 3.8.0.


> I was thinking of moving directly to
> 3.9.0, but I'm unsure about the stability....
>
Look! I am not the right person to say about stability of 3.9.0. I am to much 
involved in 3.9.0 to have a objective viewpoint. While I think that I have a 
objective viewpoint I know that I can't have it since much code is coming from 
me and this voids my objectivity.
That said, I can assure you that stability in 3.9.0 is better then the 
stability of 3.6.8. 99.9% of all software will never be 100% bug free and DSPAM 
is no exception. However... much bug fixing has been done on 3.9.0 to increase 
stability of DSPAM. If you go and look inside our Bug Tracker you will see many 
tickets about DSPAM crashing, DSPAM not being stable, etc... and you will see 
that all of those issues have been addressed and fixed and you will see that 
the original reporter is not reopening the ticket or opening a new ticket for 
the original issue. This is sure an indicator that stability in DSPAM is 
increasing and getting better with each commit to our repository.

If I would be in your situation then I would download DSPAM 3.9.0 BETA and 
install it on a test system and just train it with those 5000 mails you have 
there and then take my mailbox and push all mails over 3.9.0 and look how well 
it scores. If the result is satisfying me then I would go on and do a migration 
of my account from the production to the 3.9.0 test system. And I would look 
where I have issues while doing that test migration and look how I can fix 
them. If that would go well then I would go on and try to migrate 1/10 of all 
mailboxes and look what other issues I have while doing that test migration and 
again I would look how to solve them. After having done all those tasks I would 
have all needed informations to sit down and actually plan the whole migration 
for my productive 3.6.8 system. I would have enough numbers to know how long 
will be the downtime of my production while moving to 3.9.0 (or 3.8.0. 
Depending on which version you tested), I would know what issues I could 
possibly face and how to solve them, I would know if I can reuse my old tokens 
or if I should start from zero, etc...


> Users are already
> calling and complaining about ham ending up in the spamboxes :)
> 
Now it just depends how much those users put pressure on you and how much a 
happy customer is worth for you compared to the work needed to migrate your 
system to 3.8.0 or 3.9.0.


> > Beside the 3.6.8 version of DSPAM? Not much (if at all). From what I see
> above you can't much improve your situation with 3.6.8.
> >
> >
> >> Any way to debug the factors/tokens?
> >>
> > Debug in what way?
> 
> Such as why tokens with 0.01 probability end up as spam (or maybe I
> dont understand this correctly, but I've seen the v*gra tokens having
> like 0.96 probability, which is more understandable..).
> 
Aha. It's pure mathematics. The algorithms used inside DSPAM are responsible 
for the computation of the probability. Running DSPAM in debug mode will show 
you the end result but it will not show the full blown up computation for the 
probability. You will just see that token "ABC" has probability 0.01 but you 
will not see the computation saying P(A|B) * P(A) / P(B) = 0.01. 


> Maybe there is some problem with the global group/user handling? (i.e.
> users are normally not training themselves.)
> Should retraining be done with dspam --user globaluser  or no user
> setting at all? (only using the uid in the signature).
> 
YES! Retraining should be done with the user having the issue. Retraining the 
global user with a signature (and uid) not belonging to him is useless! If you 
have enabled retraining with uid then it's not important with which user you 
are retraining (as long as you allow the user with which you are retraining to 
do the retaining. Aka: You trust that user to do the retraining for other 
users) since internally DSPAM will anyway retrain the message for the proper 
user.


> I have also tried to first do a reclassification with source=error,
> and also tried retraining them instead as corpus, after removing the
> previous dspam header and signature data. Maybe this has a negative
> impact on the statistics?
> 
You can't and should not use source=error with a signature that does not belong 
to the user you are training (if you have not enabled uid based retraining). 
You can however use source=corpus after removing the DSPAM headers and train 
then your global user but that does not have that huge effect you expect. 
Better is to really retrain the message with the user that got the message in 
the first place and only use source=corpus for the global user.


> BR,
>  Sven
> 
Steve
-- 
Jetzt kostenlos herunterladen: Internet Explorer 8 und Mozilla Firefox 3 -
sicherer, schneller und einfacher! http://portal.gmx.net/de/go/atbrowser

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with 
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
Dspam-user mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspam-user

Re: [Dspam-user] high level of missed ham, but all factors at 0.01000

Reply via email to