Re: [Dspam-user] Increase Spam Hit Rate

Steve Fatula Wed, 18 Apr 2012 16:00:54 -0700

From: Stevan Bajić <[email protected]>
>To: [email protected] 
>Sent: Wednesday, April 18, 2012 3:41 PM
>Subject: Re: [Dspam-user] Increase Spam Hit Rate
> 
>
>This is not good. But the above data is not that horrible. Anyway... allow me 
>to ask you a bunch of questions:
>1) When you get a FN or a FP and then you retrain and later you get
    almost the same message again, does DSPAM classify it correctly?
    (aka: do you have the feeling DSPAM is learning quickly or rather
    slowly)
>
>If I get the exact same message, it sometimes still shows up on not spam, but, 
>mostly, it shows up as spam the next time. So, my answer is it learns a 
>specific message quickly.



2) Those ~ 6'000 processed messages from above are from an account that is how 
many days/months/years old?
>
>Well, I will assume this means while using dspam obvioously. I don't actually 
>recall. I would say possibly a year as a wild guess.


>>
>>#
>># OnFail: What to do if local delivery or quarantine should fail. If set
>># to "unlearn", DSPAM will unlearn the message prior to exiting with an
>># un successful return code. The default option, "error" will not unlearn
>># the message but return the appropriate error code. The unlearn option
>># is use-ful on some systems where local delivery failures will cause the
>># message to be requeued for delivery, and could result in the message
>># being processed multiple times. During a very large failure, however, 
>># this could cause a significant load increase.
>>#
>>OnFail unlearn
>>
>>
I would not unlrean the message on failures. Do you have any reason to set this 
to 'unlearn'?
>
>The reason would be the reason given in the comments. i.e., it would be 
>re-queued. 

>>
>>#
>># Training Mode: The default training mode to use for all operations, when
>># one has not been specified on the commandline or in the user's preferences.
>># Acceptable values are: 
>>#     toe     Train on Error (Only)
>>#     teft    Train Everything (Trains on every message)
>>#     tum     Train Until Mature (Train only tokens without enough data)
>>#     notrain Do not train or store signatures (large ISP systems, post-train)
>>#
>>TrainingMode teft
>>
>>
OUCH! I really, really, really would do TOE here. TEFT is so 'old school' and 
really does more harm than that it helps.
>
>
>
>That's enough really's for me! 

>
>#
>># Features: Specify features to activate by default; can also be specified
>># on the commandline. See the documentation for a list of available features.
>># If _any_ features are specified on the commandline, these are ignored.
>>#
>>#Feature noise
>>Feature whitelist
>>
>>
I strongly advise you to enable 'noise' too.
>
>
>
>Ok, I don't really seem to get the sort of spam it was meant for, but, 
>wouldn't hurt.



>
>
># Training Buffer: The training buffer waters down statistics during training.
>># It is designed to prevent false positives, but can also dramatically reduce
>># dspam's catch rate during initial training. This can be a number from 0
>># (no buffering) to 10 (maximum buffering). If you are paranoid about false
>># positives, you should probably enable this option.
>>#
>>Feature tb=3
>>
>>
Why do you set the training buffer to 3? Why not the 5 (the default)? Or why 
not disabling it?
>
>Oh, I don't remember any more. It was a while ago. At the time, I believe not 
>a single message was ever classified as SPAM, so, was experimenting during the 
>training period.

>>
>>#
>># Preferences: Specify any preferences to set by default, unless otherwise
>># overridden by the user (see next section) or a default.prefs file.
>># If user or default.prefs are found, the user's preferences will override any
>># defaults.
>>#
>>Preference "trainingMode=TEFT"# { TOE | TUM | TEFT | NOTRAIN } -> default:teft
Bad (IMHO). Set this to TOE.
>
>Ok
Preference "enableBNR=off"# { on | off } -> default:off
I would enable BNR. This helps a lot.
>
>Not a problem

>>
>># If you're running DSPAM in client/server (daemon) mode, uncomment the
>># setting below to override the default connection cache size (the number
>># of connections the server pools between all clients). The connection cache
>># represents the maximum number of database connections *available* and should
>># be set based on the maximum number of concurrent connections you're likely
>># to have. Each connection may be used by only one thread at a time, so all
>># other threads _will block_ until another connection becomes available.
>>#
>>MySQLConnectionCache25
>>
>>
I miss a space before the '25'.
>
>No, it's there in the file, it was a tab character I think


>Ohhh boy! From where is that list? Looks like one of my older
    IgnoreHeader list.
>
>It's mostly someones list, don't recall whose. It seemed like a very good idea 
>when I tracked what dspam was doing with various messages, wasting time on 
>headers with useless data in them. Is that not a good thing to ignore? I also 
>have added a few headers on my own to messages, such as the geo id and a few 
>others I already have when pre-processing the message anyway. I presumed this 
>would help with countries that seem to be mostly spammy.


>
>Okay. In short: The config is okay. I would mainly go away from
    TEFT. It is pure evil. While it might deliver you quickly results in
    the beginning, it will bite you in the future and the older the data
    gets in the storage backend. TOE is way better for you.
>
>My advise would be (the order is important):
>
>       * Switch to TOE
>       * Enable 'noise' and 'BNR'
>       * Create new user (lets call that user SpamHitRate)
>       * Disable whitelisting and other mambo jambo for that user
>       * Train the user with dspam_train
>       * Remove ALL TOKENS and STATISTICS for ALL USERS except for the user 
> SpamHitRate
>       * Use the user SpamHitRate as a global merged group
>Tell all users on your system that you fine tuned the anti spam
    system and that they should expect the filter to make errors and
    that you expect from them to correct those errors by doing training.
    The good thing is that if they don't train / retrain the system,
    they will not destroy as fast they accuracy with TOE as they do with
    TEFT. The other good thing is that you can take all the time in the
    world to train that 'SpamHitRate' user and do all what is needed to
    get a good catch rate for that user and then convert it to a merged
    global group and let instantly all your users profit from that
    training. On one hand this will drastically reduce downtime of the
    anti-spam filter and on the other hand it will increase the catch
    rate instantly.
>
>Ok, so, here's where it's unfamiliar to me. I have never used dspam_train 
>(user re-training works via dovecot antispam plugin). The doc implies you want 
>to use a corpus for this, I have none who purpose was that. Certainly, I don't 
>have all those messages that passed through dspam the first time. Probably, 
>90% of them are gone. I delete things.

So, is the thought here to take all messages I currently have in my inbox, 
trash, etc., that I know are not spam and, run them through dspam_train as 
nonspam, and, then take the one folder that is spam, and run them 
through dspam_train as spam, and simply go from there? (I understand the merged 
group portion I think, have never used one yet though). So, I could do this for 
all users who participated in training so that the one SpamHitRate user could 
benefit from all of those.

As far as removing all tokens and stats for all but one user, there is no 
utility for this is there? Or, do I simply craft a bunch of MySQL statements?



>If you want then I can help you to get there where you want to be by
    sending you more info how to setup that new system. You heed however
    really to delete your old data. It looks that your old data is not
    good enough. And we have made DSPAM so much better that erasing
    token data and starting from scratch is producing very fast, very
    good results. In the past you had to wait days/weeks until DSPAM
    catched up but today this is not any more the case.
>
>
>Well, if merged groups are documented somewhere pretty well, I am sure I can 
>figure it out. If you have something specifically, feel free to send it my way.

------------------------------------------------------------------------------
Better than sec? Nothing is better than sec when it comes to
monitoring Big Data applications. Try Boundary one-second 
resolution app monitoring today. Free.
http://p.sf.net/sfu/Boundary-dev2dev

_______________________________________________
Dspam-user mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspam-user

Re: [Dspam-user] Increase Spam Hit Rate

Reply via email to