Re: [Dspam-user] high level of missed ham, but all factors at 0.01000

Dov Zamir Wed, 26 Aug 2009 03:33:09 -0700

On 08/26/2009 01:00 PM, Sven Karlsson wrote:
> On Tue, Aug 25, 2009 at 10:27 PM, Steve<[email protected]>  wrote:
>>
>> -------- Original-Nachricht --------
>>> Datum: Tue, 25 Aug 2009 21:33:19 +0200
>>> Von: Sven Karlsson<[email protected]>
>>> An: [email protected]
>>> Betreff: [Dspam-user] high level of missed ham, but all factors at 0.01000
>
>>>          X-NS-Message-Id*BD74, 0.01000
>>>
>> Uhh.. bad, bad, bad! I see to much HTML tags there. This is sure not DSPAM 
>> 3.9.0. Right?
>
> No, 3.6.8 as you noted below. But isn't it strange that the most
> significant tokens are at 0.01, and it is still considered spam?
>
>>> Other strangeness: most factors displayed seems to be from the header,
>>> such as month*day pairs (although not in this example). I would assume
>>> that the email content would account for better indication of
>>> ham/spam.
>>>
>> That is sure true but you probably use one of the Bayesian algorithms and 
>> they only use the most significant tokens (15 tokens and up but not endless 
>> up). If you want all tokens to be considered then you should use naïve as 
>> this would process all tokens.
>
>
> Ok.
>
>>> Even more strangeness: The "improbability drive" shows "1 in 151
>>> chance of being ham" or "1 in 151 chance of being spam" in 95% of the
>>> cases (of 2146 examined emails). I would expect a lot more variation
>>> here. Does this indicate a problem?
>>>
>> YES! Something is not right with the statistical counters. Is that issue 
>> only on your setup or do you have other users having the same issue?
>
> This was for all users.
>
>>
>>
>>> The setup scenario is for about 1000 mailboxes, using a global user,
>>> TOE training and initial corpus of about 5000 manually sorted
>>> spam/ham. There is a central periodic TOE training done about once a
>>> week for a sample of all messages, training the globaluser.
>>>
>> I don't understand this. What are you training once a week? New and fresh 
>> set of HAM/SPAM or the same manually sorted 5000 HAM/SPAM messages?
>
>
> New email; one admin goes through a global mailbox and retrains the
> obvious missed spam and hams. This means that not all FP/FN are
> retrained, but it should be OK since its TOE training (even though
> some accuracy is lost). It also means that training may be focused on
> for example certain days of the week (the admin doing the training is
> more alert when starting at the monday emails, but may stop training
> at wednesday emails, leaving thursday-sunday untrained. This may give
> an unfair balance I assume.


wow! I'm realy surprised that dspam is doing as well as this, given the 
sporadic training you're doing. How can you expect dspam to know what is 
spam and what isn't spam/ham if some of the errors go without 
retraining? Sounds to me that you are really confusing the engine, since 
an untrained error contributes to further errors.
>
>>
>>
>>> Algorithm graham burton
>>>
>> AHA! So there we are. That's the reason for the reduced amount of tokens on 
>> the show factors output. This is btw nothing bad. It's not necessarily 
>> needed to process all tokens to get a good result.
>
> Ok.
>
>>
>>
>>> PValue graham
>>>
>> Uhh... if you have that in PValue then this must be DSPAM 3.6.8 or less. Am 
>> I right?
>>
>>
>>> libmysql drv storage driver
>>>
>>> Using dspam 3.6.8 shipped with Debian.
>>>
>> Aha. Yes. I was right. DSPAM 3.6.8. Have you considered updating your DSPAM 
>> setup? 3.8.0 at least. DSPAM 3.6.8 does not offer you much to improve your 
>> situation you currently are facing.
>
> Can 3.8.0 be used in production? I was thinking of moving directly to
> 3.9.0, but I'm unsure about the stability.... Users are already
> calling and complaining about ham ending up in the spamboxes :)
>
>> Beside the 3.6.8 version of DSPAM? Not much (if at all). From what I see 
>> above you can't much improve your situation with 3.6.8.
>>
>>
>>> Any way to debug the factors/tokens?
>>>
>> Debug in what way?
>
> Such as why tokens with 0.01 probability end up as spam (or maybe I
> dont understand this correctly, but I've seen the v*gra tokens having
> like 0.96 probability, which is more understandable..).
>
>
> Maybe there is some problem with the global group/user handling? (i.e.
> users are normally not training themselves.)
> Should retraining be done with dspam --user globaluser  or no user
> setting at all? (only using the uid in the signature).
>
> I have also tried to first do a reclassification with source=error,
> and also tried retraining them instead as corpus, after removing the
> previous dspam header and signature data. Maybe this has a negative
> impact on the statistics?
>
> BR,
>   Sven
>
> ------------------------------------------------------------------------------
> Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
> trial. Simplify your report design, integration and deployment - and focus on
> what you do best, core application coding. Discover what's new with
> Crystal Reports now.  http://p.sf.net/sfu/bobj-july
>
> Dspam-user mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dspam-user
>
> !DSPAM:500,4a950a2d260342283011438!
>


------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with 
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
Dspam-user mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspam-user

Re: [Dspam-user] high level of missed ham, but all factors at 0.01000

Reply via email to