On 19.04.2012 20:34, Steve Fatula wrote:

    *From:* Stevan Bajić <ste...@bajic.ch>
    *To:* dspam-user@lists.sourceforge.net
    *Sent:* Wednesday, April 18, 2012 6:37 PM
    *Subject:* Re: [Dspam-user] Increase Spam Hit Rate

    If I get the exact same message, it sometimes still shows up on
    not spam, but, mostly, it shows up as spam the next time. So, my
    answer is it learns a specific message quickly.

    Okay. This indicates then that you have a high in-balance in
    spam/ham ratio. Can you post the result of the following SQL query:
    select sum(spam_hits),sum(innocent_hits) from dspam_token_data
    where uid=<insert_here_your_own_uid>;

+----------------+--------------------+
| sum(spam_hits) | sum(innocent_hits) |
+----------------+--------------------+
|          70517 |            4995414 |
+----------------+--------------------+

Spam: 1.391985007296783157922995792876 %
Ham: 98.608014992703216842077004207124 %

This is crazy. You have about 70 times more Ham tokens in the database than Spam tokens. Your tokens are totally unbalanced.

Running the same query against my uid on my system:
+----------------+--------------------+
| sum(spam_hits) | sum(innocent_hits) |
+----------------+--------------------+
|        3101346 |           16944007 |
+----------------+--------------------+

And doing the same query but this time using only the data from my merged group:
+----------------+--------------------+
| sum(spam_hits) | sum(innocent_hits) |
+----------------+--------------------+
|      347984546 |          329350014 |
+----------------+--------------------+

On the merged group I have a well balanced ratio between Spam/Ham. On my own data I have about 5.4 times more Ham tokens than Spam tokens. But yours is 70 times more and if I merge my tokens with the merged group (which is done anyway at run time) then the difference is not that much.

Merged:
+----------------+--------------------+
| sum(spam_hits) | sum(innocent_hits) |
+----------------+--------------------+
|      351085892 |          346294021 |
+----------------+--------------------+




        #
        # OnFail: What to do if local delivery or quarantine should
        fail. If set
        # to "unlearn", DSPAM will unlearn the message prior to
        exiting with an
        # un successful return code. The default option, "error"
        will not unlearn
        # the message but return the appropriate error code. The
        unlearn option
        # is use-ful on some systems where local delivery failures
        will cause the
        # message to be requeued for delivery, and could result in
        the message
        # being processed multiple times. During a very large
        failure, however,
        # this could cause a significant load increase.
        #
        OnFail unlearn

        I would not unlrean the message on failures. Do you have any
        reason to set this to 'unlearn'?

    The reason would be the reason given in the comments. i.e., it
    would be re-queued.
    Aha. You got that wrong. The note says that it could be useful to
    set it to unlearn IF a local delivery failure is resulting in a
    requeue of the message. Is this the case for you? Does a failure
    in the local delivery (in your case a delivery to the dovecot LMTP
    service) result in a re-delivery/re-queue of the exact same message?

Let me review the postfix config.... We are using the after queue content filter technique (not the technique most people use most likely) to send the email to dspam. dspam then sends the mail to dovecot via lmtp. If dspam cannot for some reason send the mail to dovecot via lmtp, then, it would stay in the postfix queue and eventaully retry, thus sending to dspam again. I figure that would qualify as being "re-queued". True?

Yes.


        #
        # Features: Specify features to activate by default; can
        also be specified
        # on the commandline. See the documentation for a list of
        available features.
        # If _any_ features are specified on the commandline, these
        are ignored.
        #
        #Feature noise
        Feature whitelist

        I strongly advise you to enable 'noise' too.


    Ok, I don't really seem to get the sort of spam it was meant for,
    but, wouldn't hurt.

    I don't understand this answer. Can you rephrase it?

The type of spam we seem to get never has what the documentation says noise handles. So, we don't really get wordlist attack style spam messages, so, it wasn't enabled. I just was saying I would enable it anyway in case we ever do.

        # Training Buffer: The training buffer waters down
        statistics during training.
        # It is designed to prevent false positives, but can also
        dramatically reduce
        # dspam's catch rate during initial training. This can be a
        number from 0
        # (no buffering) to 10 (maximum buffering). If you are
        paranoid about false
        # positives, you should probably enable this option.
        #
        Feature tb=3

        Why do you set the training buffer to 3? Why not the 5 (the
        default)? Or why not disabling it?

    Oh, I don't remember any more. It was a while ago. At the time, I
    believe not a single message was ever classified as SPAM, so, was
    experimenting during the training period.
    This sounds strange to me. I mean the fact that not a single
    message was EVER classified as SPAM.

Well, the question is when was the first spame detected, i.e., after how many false negatives. That I do not recall, so, it may not be too strange. It's jsut why I had changed the setting to see what impact it might have had.

Did you run from the beginning with OSB or had you another tokenizer before?



        Ohhh boy! From where is that list? Looks like one of my older
        IgnoreHeader list.

    It's mostly someones list, don't recall whose. It seemed like a
    very good idea when I tracked what dspam was doing with various
    messages, wasting time on headers with useless data in them. Is
    that not a good thing to ignore?
    It is! Ignoring headers is good for accuracy but bad for speed
    (DSPAM needs to compare every header against the list. Always
    processing the whole ignore header list. We could make the code
    faster so that it uses hashed tables but the C code today is not
    doing that).

Probably would be a good idea if it was handled as hashed tables. I don't see why everyone wouldn't want to ignore most of the headers. Most all of them are useless.

As usual: People think the more data there is to process the better the engine can decide which class a message is. I read in the past a document where they explicitly mentioned that ignoring certain headers (like the date, message id, etc) would slightly (1 digit percentage) increase the accuracy.

        Ok, so, here's where it's unfamiliar to me. I have never used
        dspam_train (user re-training works via dovecot antispam
        plugin). The doc implies you want to use a corpus for this, I
        have none who purpose was that. Certainly, I don't have all
        those messages that passed through dspam the first time.
        Probably, 90% of them are gone. I delete things.


    So, is the thought here to take all messages I currently have in
    my inbox, trash, etc., that I know are not spam and, run them
    through dspam_train as nonspam, and, then take the one folder
    that is spam, and run them through dspam_train as spam, and
    simply go from there? (I understand the merged group portion I
    think, have never used one yet though). So, I could do this for
    all users who participated in training so that the one
    SpamHitRate user could benefit from all of those.

    Spam corpi is ultra easy to get. Ham corpi is a problem. You can
    find ham corpi on the net but usually it would be better to use
    your own. What you could do is use the messages you (and your
    users) have sent. Don't use the inbox because those messages have
    the X-DSPAM-... headers and you would need to clean them. Use
    better the one you find in the send folder.

I wondered about that (dspam tokens). However, I could just sed them out with no problem. Make a copy of the inbox and sed out all of the dspam headers. No problem there, right?

Right. In the header they are easy to sed out. In the body it is slightly more challenging.


I take it if I want to use my SPAM folder (which has 30 days worth of trained spam), then, I would also have to remove those DSPAM headers too, right?

Right. But judging from your low Spam tokens from above I am not sure how much that would be. Do you have a lot of Spam messages? If you want I can easily provide you good quality Spam corpi with many (thousands) of Spam messages. For example: One Spam corpus creator that I often use to check how well my DSPAM is working has 414'731 Spam messages in his 2011 corpus, 38'807 for January 2012, 48'237 for February 2012, 49'178 for March 2012 and so far 20'705 for April 2012. Alone training with his data from year 2011 and the available data from 2012 would give you over a half Million spam mails to train with. Maybe I am wrong but I would be surprised if you have that many data available to train with.

Or if you want you could send me your Ham coprus and I can train with my own made training method(*) and create for you a merged group and export the SQL data and send it to you.

*) Search the mailing list for keywords like 'TONE', 'asymmetric thickness', 'double sided' if you want to find out about how I train.

Thanks for all your time.

Steve


--
Kind Regards from Switzerland,

Stevan Bajić

------------------------------------------------------------------------------
For Developers, A Lot Can Happen In A Second.
Boundary is the first to Know...and Tell You.
Monitor Your Applications in Ultra-Fine Resolution. Try it FREE!
http://p.sf.net/sfu/Boundary-d2dvs2
_______________________________________________
Dspam-user mailing list
Dspam-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspam-user

Reply via email to