Re: [Dspam-user] Increase Spam Hit Rate

Steve Fatula Thu, 19 Apr 2012 11:41:04 -0700

From: Stevan Bajić <ste...@bajic.ch>
>To: dspam-user@lists.sourceforge.net 
>Sent: Wednesday, April 18, 2012 6:37 PM
>Subject: Re: [Dspam-user] Increase Spam Hit Rate
> 
>
>If I get the exact same message, it sometimes still shows up on not spam, but, 
>mostly, it shows up as spam the next time. So, my answer is it learns a 
>specific message quickly.
>>
>>
Okay. This indicates then that you have a high in-balance in spam/ham ratio. 
Can you post the result of the following SQL query:
>select sum(spam_hits),sum(innocent_hits) from dspam_token_data where
    uid=<insert_here_your_own_uid>;
>
>
+----------------+--------------------+
| sum(spam_hits) | sum(innocent_hits) |
+----------------+--------------------+
|          70517 |            4995414 |
+----------------+--------------------+


>>>>
>>>>#
>>>># OnFail: What to do if local delivery or quarantine should fail. If set
>>>># to "unlearn", DSPAM will unlearn the message prior to exiting with an
>>>># un successful return code. The default option, "error" will not unlearn
>>>># the message but return the appropriate error code. The unlearn option
>>>># is use-ful on some systems where local delivery failures will cause the
>>>># message to be requeued for delivery, and could result in the message
>>>># being processed multiple times. During a very large failure, however, 
>>>># this could cause a significant load increase.
>>>>#
>>>>OnFail unlearn
>>>>
>>>>
I would not unlrean the message on failures. Do you have any reason to set this 
to 'unlearn'?
>>>
>>>
The reason would be the reason given in the comments. i.e., it would be 
re-queued. 
>>
Aha. You got that wrong. The note says that it could be useful to set it to 
unlearn IF a local delivery failure is resulting in a requeue of the message. 
Is this the case for you? Does a failure in the local delivery (in your case a 
delivery to the dovecot LMTP service) result in a re-delivery/re-queue of the 
exact same message?
>
>Let me review the postfix config.... We are using the after queue content 
>filter technique (not the technique most people use most likely) to send the 
>email to dspam. dspam then sends the mail to dovecot via lmtp. If dspam cannot 
>for some reason send the mail to dovecot via lmtp, then, it would stay in the 
>postfix queue and eventaully retry, thus sending to dspam again. I figure that 
>would qualify as being "re-queued". True?

>>>
>>>#
>>>># Features: Specify features to activate by default; can also be specified
>>>># on the commandline. See the documentation for a list of available 
>>>>features.
>>>># If _any_ features are specified on the commandline, these are ignored.
>>>>#
>>>>#Feature noise
>>>>Feature whitelist
>>>>
>>>>
I strongly advise you to enable 'noise' too.
>>>
>>>
>>>
>>>
Ok, I don't really seem to get the sort of spam it was meant for, but, wouldn't 
hurt.
>>
>>
I don't understand this answer. Can you rephrase it?
>
>The type of spam we seem to get never has what the documentation says noise 
>handles. So, we don't really get wordlist attack style spam messages, so, it 
>wasn't enabled. I just was saying I would enable it anyway in case we ever do.


# Training Buffer: The training buffer waters down statistics during training.
>>>># It is designed to prevent false positives, but can also dramatically 
>>>>reduce
>>>># dspam's catch rate during initial training. This can be a number from 0
>>>># (no buffering) to 10 (maximum buffering). If you are paranoid about false
>>>># positives, you should probably enable this option.
>>>>#
>>>>Feature tb=3
>>>>
>>>>
Why do you set the training buffer to 3? Why not the 5 (the default)? Or why 
not disabling it?
>>>
>>>
Oh, I don't remember any more. It was a while ago. At the time, I believe not a 
single message was ever classified as SPAM, so, was experimenting during the 
training period.
This sounds strange to me. I mean the fact that not a single message was EVER 
classified as SPAM.
>
>Well, the question is when was the first spame detected, i.e., after how many 
>false negatives. That I do not recall, so, it may not be too strange. It's 
>jsut why I had changed the setting to see what impact it might have had.


>>>Ohhh boy! From where is that list? Looks like one of
                    my older IgnoreHeader list.
>>>
>>>
It's mostly someones list, don't recall whose. It seemed like a very good idea 
when I tracked what dspam was doing with various messages, wasting time on 
headers with useless data in them. Is that not a good thing to ignore?
It is! Ignoring headers is good for accuracy but bad for speed (DSPAM needs to 
compare every header against the list. Always processing the whole ignore 
header list. We could make the code faster so that it uses hashed tables but 
the C code today is not doing that).
>
>Probably would be a good idea if it was handled as hashed tables. I don't see 
>why everyone wouldn't want to ignore most of the headers. Most all of them are 
>useless.

Ok, so, here's where it's unfamiliar to me. I have never used dspam_train (user 
re-training works via dovecot antispam plugin). The doc implies you want to use 
a corpus for this, I have none who purpose was that. Certainly, I don't have 
all those messages that passed through dspam the first time. Probably, 90% of 
them are gone. I delete things.
>>>
>>
>>
>>So, is the thought here to take all messages I currently have in my inbox, 
>>trash, etc., that I know are not spam and, run them through dspam_train as 
>>nonspam, and, then take the one folder that is spam, and run them 
>>through dspam_train as spam, and simply go from there? (I understand the 
>>merged group portion I think, have never used one yet though). So, I could do 
>>this for all users who participated in training so that the one SpamHitRate 
>>user could benefit from all of those.
>>
>>
Spam corpi is ultra easy to get. Ham corpi is a problem. You can find ham corpi 
on the net but usually it would be better to use your own. What you could do is 
use the messages you (and your users) have sent. Don't use the inbox because 
those messages have the X-DSPAM-... headers and you would need to clean them. 
Use better the one you find in the send folder.
>
>I wondered about that (dspam tokens). However, I could just sed them out with 
>no problem. Make a copy of the inbox and sed out all of the dspam headers. No 
>problem there, right?

I take it if I want to use my SPAM folder (which has 30 days worth of trained 
spam), then, I would also have to remove those DSPAM headers too, right?

Thanks for all your time.

Steve

------------------------------------------------------------------------------
For Developers, A Lot Can Happen In A Second.
Boundary is the first to Know...and Tell You.
Monitor Your Applications in Ultra-Fine Resolution. Try it FREE!
http://p.sf.net/sfu/Boundary-d2dvs2

_______________________________________________
Dspam-user mailing list
Dspam-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspam-user

Re: [Dspam-user] Increase Spam Hit Rate

Reply via email to