Re: [Dspam-user] Increase Spam Hit Rate

Stevan Bajić Fri, 20 Apr 2012 00:50:38 -0700

On 20.04.2012 07:32, Steve Fatula wrote:


    *From:* Stevan Bajić <[email protected]>
    *To:* "[email protected]"
    <[email protected]>
    *Sent:* Thursday, April 19, 2012 5:49 PM
    *Subject:* Re: [Dspam-user] Increase Spam Hit Rate

    Spam: 1.391985007296783157922995792876 %
    Ham: 98.608014992703216842077004207124 %

    This is crazy. You have about 70 times more Ham tokens in the
    database than Spam tokens. Your tokens are totally unbalanced.

But, that's a fact, right?

Yes.

It just means not that much spam, and, lots of hammy words, right?

Yes.

it does not indicate much wrong. I do classify every single spam.

It does not indicate any wrong doing from your part. But consideringyour statement that spam is not well captured and looking at the amountof Spam/Ham tokens then one can say that this is a strong indicationthat the chosen training mode is not good for you.

            Why do you set the training buffer to 3? Why not the 5
            (the default)? Or why not disabling it?

        Oh, I don't remember any more. It was a while ago. At the
        time, I believe not a single message was ever classified as
        SPAM, so, was experimenting during the training period.
        This sounds strange to me. I mean the fact that not a single
        message was EVER classified as SPAM.

    Well, the question is when was the first spame detected, i.e.,
    after how many false negatives. That I do not recall, so, it may
    not be too strange. It's jsut why I had changed the setting to
    see what impact it might have had.
    Did you run from the beginning with OSB or had you another
    tokenizer before?
From the very first email, osb. But I also report to (gasp) spamcop,and, it does seem to remove spammers many times, but not always ofcourse. For a couple cases, I even went to the pain of reporting tothe data center and they did take care of it. So, done pretty well onthe controlling front. And, like another suggested who said they don'teven use dspam much, we do have a lot of anti-spam technology on thepostfix end.

I have a lot of anti-spam technology running before DSPAM. Usually mySpam inbound is somewhere between 3% and 6% (this is not only for me.The whole email flow for all the domains combined together, includingSpam honey pots). Some domains could easily live without DSPAM becauseall their Spam is already blocked before the processing is reaching DSPAM.

    Probably would be a good idea if it was handled as hashed tables.
    I don't see why everyone wouldn't want to ignore most of the
    headers. Most all of them are useless.

    As usual: People think the more data there is to process the
    better the engine can decide which class a message is.
    I read in the past a document where they explicitly mentioned that
    ignoring certain headers (like the date, message id, etc) would
    slightly (1 digit percentage) increase the accuracy.

I don't see how having useless data = better results (not you,"people"). I would think it would be more likely to give incorrectresults, I am surpised by 1%.

Ach. This is a phenomena that is hard to describe. I see that all thetime. I remember already decades ago when doing first project thingslike this:

IT guy: What data do we need to store?
Customer: Just ABC.
IT guy: Okay. We are collecting ABC but we as well have XYZ.
Customer: Good. I just need ABC.
IT guy: We are going to store ABC and XYZ.
Customer: Why?
IT guy: Well.... we got that data when clients are using our application.
Customer: And?

IT guy: And why not storing XYZ as well in the database? It's cheap tostore and you might one day need XYZ.Customer: Okay. I just need ABC and as long it does not disturb speed oranything else, you can store XYZ.


IT people tend to stockpile data.

    Right. But judging from your low Spam tokens from above I am not
    sure how much that would be. Do you have a lot of Spam messages?
    If you want I can easily provide you good quality Spam corpi with
    many (thousands) of Spam messages. For example: One Spam corpus
    creator that I often use to check how well my DSPAM is working has
    414'731 Spam messages in his 2011 corpus, 38'807 for January 2012,
    48'237 for February 2012, 49'178 for March 2012 and so far 20'705
    for April 2012. Alone training with his data from year 2011 and
    the available data from 2012 would give you over a half Million
    spam mails to train with. Maybe I am wrong but I would be
    surprised if you have that many data available to train with.
That would be the understatement of the century. I have all my spamsgoing back to Jan 1 as it turns out. That would be all of 80 messagesin my case, not quite adding up to hundreds of thousands, almost. ;-)Others on the system likely have more, but, I do not. Almost all of itis yahoo, hotmail spam.
I would *love* to have the corpus for 2011 and 2012. PLEASE, send themmy way. That would likely benefit the others. The question of courseis how. Can I download them from somewhere? Sounds too big to email. Ican give you a FTP site if you do not have a place to put them.

I will send you a mail and you can download them yourself.

If you really want to benefit others with it then I strongly urge you tomake a merged group. That will allow you to prepare stuff in advancewithout disturbing any one and then when you are finished you just turnon the merged group and out of no where all your users will have anincrease in accuracy.

    Or if you want you could send me your Ham coprus and I can train
    with my own made training method(*) and create for you a merged
    group and export the SQL data and send it to you.
If you give me the SPAM corpus, I can just run dspam_train on it (andI'd even add my 80). But it will be pretty unbalanced since I have fewHAM messages since I only keep a month (maybe a few thousandmessages). I am not sure that matters much? In the end, won't thedetection still work, maybe biased towards SPAM at first, but, surely,it woudln't take too long to stop false positives?

You are right. If you run TOE then a unbalanced training corpus is notsuch a big problem, since DSPAM will only learn on error.

Thanks in advance!



--
Kind Regards from Switzerland,

Stevan Bajić

------------------------------------------------------------------------------
For Developers, A Lot Can Happen In A Second.
Boundary is the first to Know...and Tell You.
Monitor Your Applications in Ultra-Fine Resolution. Try it FREE!
http://p.sf.net/sfu/Boundary-d2dvs2

_______________________________________________
Dspam-user mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspam-user

Re: [Dspam-user] Increase Spam Hit Rate

Reply via email to