On 20.04.2012 07:32, Steve Fatula wrote:

    *From:* Stevan Bajić <ste...@bajic.ch>
    *To:* "dspam-user@lists.sourceforge.net"
    <dspam-user@lists.sourceforge.net>
    *Sent:* Thursday, April 19, 2012 5:49 PM
    *Subject:* Re: [Dspam-user] Increase Spam Hit Rate

    Spam: 1.391985007296783157922995792876 %
    Ham: 98.608014992703216842077004207124 %

    This is crazy. You have about 70 times more Ham tokens in the
    database than Spam tokens. Your tokens are totally unbalanced.

But, that's a fact, right?
Yes.

It just means not that much spam, and, lots of hammy words, right?
Yes.

it does not indicate much wrong. I do classify every single spam.
It does not indicate any wrong doing from your part. But considering your statement that spam is not well captured and looking at the amount of Spam/Ham tokens then one can say that this is a strong indication that the chosen training mode is not good for you.

            Why do you set the training buffer to 3? Why not the 5
            (the default)? Or why not disabling it?

        Oh, I don't remember any more. It was a while ago. At the
        time, I believe not a single message was ever classified as
        SPAM, so, was experimenting during the training period.
        This sounds strange to me. I mean the fact that not a single
        message was EVER classified as SPAM.

    Well, the question is when was the first spame detected, i.e.,
    after how many false negatives. That I do not recall, so, it may
    not be too strange. It's jsut why I had changed the setting to
    see what impact it might have had.

    Did you run from the beginning with OSB or had you another
    tokenizer before?

From the very first email, osb. But I also report to (gasp) spamcop, and, it does seem to remove spammers many times, but not always of course. For a couple cases, I even went to the pain of reporting to the data center and they did take care of it. So, done pretty well on the controlling front. And, like another suggested who said they don't even use dspam much, we do have a lot of anti-spam technology on the postfix end.
I have a lot of anti-spam technology running before DSPAM. Usually my Spam inbound is somewhere between 3% and 6% (this is not only for me. The whole email flow for all the domains combined together, including Spam honey pots). Some domains could easily live without DSPAM because all their Spam is already blocked before the processing is reaching DSPAM.

    Probably would be a good idea if it was handled as hashed tables.
    I don't see why everyone wouldn't want to ignore most of the
    headers. Most all of them are useless.

    As usual: People think the more data there is to process the
    better the engine can decide which class a message is.
    I read in the past a document where they explicitly mentioned that
    ignoring certain headers (like the date, message id, etc) would
    slightly (1 digit percentage) increase the accuracy.

I don't see how having useless data = better results (not you, "people"). I would think it would be more likely to give incorrect results, I am surpised by 1%.
Ach. This is a phenomena that is hard to describe. I see that all the time. I remember already decades ago when doing first project things like this:
IT guy: What data do we need to store?
Customer: Just ABC.
IT guy: Okay. We are collecting ABC but we as well have XYZ.
Customer: Good. I just need ABC.
IT guy: We are going to store ABC and XYZ.
Customer: Why?
IT guy: Well.... we got that data when clients are using our application.
Customer: And?
IT guy: And why not storing XYZ as well in the database? It's cheap to store and you might one day need XYZ. Customer: Okay. I just need ABC and as long it does not disturb speed or anything else, you can store XYZ.

IT people tend to stockpile data.


    Right. But judging from your low Spam tokens from above I am not
    sure how much that would be. Do you have a lot of Spam messages?
    If you want I can easily provide you good quality Spam corpi with
    many (thousands) of Spam messages. For example: One Spam corpus
    creator that I often use to check how well my DSPAM is working has
    414'731 Spam messages in his 2011 corpus, 38'807 for January 2012,
    48'237 for February 2012, 49'178 for March 2012 and so far 20'705
    for April 2012. Alone training with his data from year 2011 and
    the available data from 2012 would give you over a half Million
    spam mails to train with. Maybe I am wrong but I would be
    surprised if you have that many data available to train with.

That would be the understatement of the century. I have all my spams going back to Jan 1 as it turns out. That would be all of 80 messages in my case, not quite adding up to hundreds of thousands, almost. ;-) Others on the system likely have more, but, I do not. Almost all of it is yahoo, hotmail spam.

I would *love* to have the corpus for 2011 and 2012. PLEASE, send them my way. That would likely benefit the others. The question of course is how. Can I download them from somewhere? Sounds too big to email. I can give you a FTP site if you do not have a place to put them.

I will send you a mail and you can download them yourself.

If you really want to benefit others with it then I strongly urge you to make a merged group. That will allow you to prepare stuff in advance without disturbing any one and then when you are finished you just turn on the merged group and out of no where all your users will have an increase in accuracy.


    Or if you want you could send me your Ham coprus and I can train
    with my own made training method(*) and create for you a merged
    group and export the SQL data and send it to you.

If you give me the SPAM corpus, I can just run dspam_train on it (and I'd even add my 80). But it will be pretty unbalanced since I have few HAM messages since I only keep a month (maybe a few thousand messages). I am not sure that matters much? In the end, won't the detection still work, maybe biased towards SPAM at first, but, surely, it woudln't take too long to stop false positives?

You are right. If you run TOE then a unbalanced training corpus is not such a big problem, since DSPAM will only learn on error.

Thanks in advance!


--
Kind Regards from Switzerland,

Stevan Bajić

------------------------------------------------------------------------------
For Developers, A Lot Can Happen In A Second.
Boundary is the first to Know...and Tell You.
Monitor Your Applications in Ultra-Fine Resolution. Try it FREE!
http://p.sf.net/sfu/Boundary-d2dvs2
_______________________________________________
Dspam-user mailing list
Dspam-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspam-user

Reply via email to