On 19.04.2012 00:54, Steve Fatula wrote:

    *From:* Stevan Bajić <ste...@bajic.ch>
    *To:* dspam-user@lists.sourceforge.net
    *Sent:* Wednesday, April 18, 2012 3:41 PM
    *Subject:* Re: [Dspam-user] Increase Spam Hit Rate

    This is not good. But the above data is not that horrible.
    Anyway... allow me to ask you a bunch of questions:
    1) When you get a FN or a FP and then you retrain and later you
    get almost the same message again, does DSPAM classify it
    correctly? (aka: do you have the feeling DSPAM is learning quickly
    or rather slowly)

If I get the exact same message, it sometimes still shows up on not spam, but, mostly, it shows up as spam the next time. So, my answer is it learns a specific message quickly.

Okay. This indicates then that you have a high in-balance in spam/ham ratio. Can you post the result of the following SQL query: select sum(spam_hits),sum(innocent_hits) from dspam_token_data where uid=<insert_here_your_own_uid>;


    2) Those ~ 6'000 processed messages from above are from an account
    that is how many days/months/years old?

Well, I will assume this means while using dspam obvioously. I don't actually recall. I would say possibly a year as a wild guess.

Okay.



    #
    # OnFail: What to do if local delivery or quarantine should fail.
    If set
    # to "unlearn", DSPAM will unlearn the message prior to exiting
    with an
    # un successful return code. The default option, "error" will not
    unlearn
    # the message but return the appropriate error code. The unlearn
    option
    # is use-ful on some systems where local delivery failures will
    cause the
    # message to be requeued for delivery, and could result in the
    message
    # being processed multiple times. During a very large failure,
    however,
    # this could cause a significant load increase.
    #
    OnFail unlearn

    I would not unlrean the message on failures. Do you have any
    reason to set this to 'unlearn'?

The reason would be the reason given in the comments. i.e., it would be re-queued.
Aha. You got that wrong. The note says that it could be useful to set it to unlearn IF a local delivery failure is resulting in a requeue of the message. Is this the case for you? Does a failure in the local delivery (in your case a delivery to the dovecot LMTP service) result in a re-delivery/re-queue of the exact same message?



    #
    # Training Mode: The default training mode to use for all
    operations, when
    # one has not been specified on the commandline or in the user's
    preferences.
    # Acceptable values are:
    #     toe     Train on Error (Only)
    #     teft    Train Everything (Trains on every message)
    #     tum     Train Until Mature (Train only tokens without
    enough data)
    #     notrain Do not train or store signatures (large ISP
    systems, post-train)
    #
    TrainingMode teft

    OUCH! I really, really, really would do TOE here. TEFT is so 'old
    school' and really does more harm than that it helps.


That's enough really's for me!
LOL. Sorry. I tried to put more weight on my answer.



    #
    # Features: Specify features to activate by default; can also be
    specified
    # on the commandline. See the documentation for a list of
    available features.
    # If _any_ features are specified on the commandline, these are
    ignored.
    #
    #Feature noise
    Feature whitelist

    I strongly advise you to enable 'noise' too.


Ok, I don't really seem to get the sort of spam it was meant for, but, wouldn't hurt.

I don't understand this answer. Can you rephrase it?



    # Training Buffer: The training buffer waters down statistics
    during training.
    # It is designed to prevent false positives, but can also
    dramatically reduce
    # dspam's catch rate during initial training. This can be a
    number from 0
    # (no buffering) to 10 (maximum buffering). If you are paranoid
    about false
    # positives, you should probably enable this option.
    #
    Feature tb=3

    Why do you set the training buffer to 3? Why not the 5 (the
    default)? Or why not disabling it?

Oh, I don't remember any more. It was a while ago. At the time, I believe not a single message was ever classified as SPAM, so, was experimenting during the training period.
This sounds strange to me. I mean the fact that not a single message was EVER classified as SPAM.



    #
    # Preferences: Specify any preferences to set by default, unless
    otherwise
    # overridden by the user (see next section) or a default.prefs file.
    # If user or default.prefs are found, the user's preferences will
    override any
    # defaults.
    #
    Preference "trainingMode=TEFT"# { TOE | TUM | TEFT | NOTRAIN } ->
    default:teft
    Bad (IMHO). Set this to TOE.

Ok

    Preference "enableBNR=off"# { on | off } -> default:off
    I would enable BNR. This helps a lot.

Not a problem


    # If you're running DSPAM in client/server (daemon) mode,
    uncomment the
    # setting below to override the default connection cache size
    (the number
    # of connections the server pools between all clients). The
    connection cache
    # represents the maximum number of database connections
    *available* and should
    # be set based on the maximum number of concurrent connections
    you're likely
    # to have. Each connection may be used by only one thread at a
    time, so all
    # other threads _will block_ until another connection becomes
    available.
    #
    MySQLConnectionCache25

    I miss a space before the '25'.

No, it's there in the file, it was a tab character I think

You where right. On my screen it was so close that I did not see the space. But now I see it.



    Ohhh boy! From where is that list? Looks like one of my older
    IgnoreHeader list.

It's mostly someones list, don't recall whose. It seemed like a very good idea when I tracked what dspam was doing with various messages, wasting time on headers with useless data in them. Is that not a good thing to ignore?
It is! Ignoring headers is good for accuracy but bad for speed (DSPAM needs to compare every header against the list. Always processing the whole ignore header list. We could make the code faster so that it uses hashed tables but the C code today is not doing that).


I also have added a few headers on my own to messages, such as the geo id and a few others I already have when pre-processing the message anyway. I presumed this would help with countries that seem to be mostly spammy.



    Okay. In short: The config is okay. I would mainly go away from
    TEFT. It is pure evil. While it might deliver you quickly results
    in the beginning, it will bite you in the future and the older the
    data gets in the storage backend. TOE is way better for you.

    My advise would be (the order is important):

      * Switch to TOE
      * Enable 'noise' and 'BNR'
      * Create new user (lets call that user SpamHitRate)
      * Disable whitelisting and other mambo jambo for that user
      * Train the user with dspam_train
      * Remove ALL TOKENS and STATISTICS for ALL USERS except for the
        user SpamHitRate
      * Use the user SpamHitRate as a global merged group


    Tell all users on your system that you fine tuned the anti spam
    system and that they should expect the filter to make errors and
    that you expect from them to correct those errors by doing
    training. The good thing is that if they don't train / retrain the
    system, they will not destroy as fast they accuracy with TOE as
    they do with TEFT. The other good thing is that you can take all
    the time in the world to train that 'SpamHitRate' user and do all
    what is needed to get a good catch rate for that user and then
    convert it to a merged global group and let instantly all your
    users profit from that training. On one hand this will drastically
    reduce downtime of the anti-spam filter and on the other hand it
    will increase the catch rate instantly.

Ok, so, here's where it's unfamiliar to me. I have never used dspam_train (user re-training works via dovecot antispam plugin). The doc implies you want to use a corpus for this, I have none who purpose was that. Certainly, I don't have all those messages that passed through dspam the first time. Probably, 90% of them are gone. I delete things.

So, is the thought here to take all messages I currently have in my inbox, trash, etc., that I know are not spam and, run them through dspam_train as nonspam, and, then take the one folder that is spam, and run them through dspam_train as spam, and simply go from there? (I understand the merged group portion I think, have never used one yet though). So, I could do this for all users who participated in training so that the one SpamHitRate user could benefit from all of those.

Spam corpi is ultra easy to get. Ham corpi is a problem. You can find ham corpi on the net but usually it would be better to use your own. What you could do is use the messages you (and your users) have sent. Don't use the inbox because those messages have the X-DSPAM-... headers and you would need to clean them. Use better the one you find in the send folder.

As far as removing all tokens and stats for all but one user, there is no utility for this is there? Or, do I simply craft a bunch of MySQL statements?

No tool. Simply craft a SQL command.


    If you want then I can help you to get there where you want to be
    by sending you more info how to setup that new system. You heed
    however really to delete your old data. It looks that your old
    data is not good enough. And we have made DSPAM so much better
    that erasing token data and starting from scratch is producing
    very fast, very good results. In the past you had to wait
    days/weeks until DSPAM catched up but today this is not any more
    the case.

Well, if merged groups are documented somewhere pretty well, I am sure I can figure it out. If you have something specifically, feel free to send it my way.
I don't know if anyone has added something into the wiki about group support? From the past I know that people often complain about documentation. I personally find things to be good (not excellent or super stellar good) documented. But I am I and if everyone out there would be happy with the documentation then we would not have so much complains about documentation. So I guess it is not pretty well documented.

Before I start. Have you read about group support in DSPAM? If not then read this here -> http://dspam.git.sourceforge.net/git/gitweb.cgi?p=dspam/dspam;a=blob;f=README;hb=HEAD#l1363


--
Kind Regards from Switzerland,

Stevan Bajić

------------------------------------------------------------------------------
Better than sec? Nothing is better than sec when it comes to
monitoring Big Data applications. Try Boundary one-second 
resolution app monitoring today. Free.
http://p.sf.net/sfu/Boundary-dev2dev
_______________________________________________
Dspam-user mailing list
Dspam-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspam-user

Reply via email to