On 22.04.2012 23:12, Steve Fatula wrote:

    *From:* Stevan Bajić <ste...@bajic.ch>
    *To:* "dspam-user@lists.sourceforge.net"
    <dspam-user@lists.sourceforge.net>
    *Sent:* Sunday, April 22, 2012 2:28 PM
    *Subject:* Re: [Dspam-user] Increase Spam Hit Rate

    That is correct, as I had mentioned, I did not have a lot of HAM
    to train with.
    At least you have around 33K Ham messages. This is not that bad.

Well, not really. Remember, on your instructions, I essentially trained the HAM many times, so, it's that many divided by the number of months from the SPAM corpus, around 16 I think.

Yes. I remember. Thanks god you run TOE. With TEFT that kind of training would have superseded the whole purpose of the training.


I have since redone and added more HAM that I dug up.

    I will build it up over time.
    Maybe you get even more if you use the mails from the send folder
    of all the users.

I got as much sent as I could, it's only retained for 30 days.
Okay. Then maybe you should go on and download other Ham corpora too?
TREC 2005: http://plg.uwaterloo.ca/~gvcormac/treccorpus/
TREC 2006: http://plg.uwaterloo.ca/~gvcormac/treccorpus06/
TREC 2007: http://plg.uwaterloo.ca/~gvcormac/treccorpus07/
SpamAssassin Public Corpus: http://spamassassin.apache.org/publiccorpus/

Most of them are old but still good enough for training just the Ham part of the corpora. Off course no one is stopping you from training the full corpus with Spam and Ham.


    Suprisingly, the first day went well.

    You mean the first day using the merged group from above? I told
    you that this approach will work. Tell me more. How long did it
    took you to train with that many messages? How long was the
    downtime? Was the production downtime as low as I told you?

Yes, using the merged group. Since I wanted a master machine with Dspam database and only the merged group training data, so, I could use to copy to other systems, I just used my handy Lion server at home. Loaded up Dspam and ran the trains on it. I don't recall how long, maybe 8 hours? Production downtime was close to 0 since there wasn't much to do and I had a MySQL script already set up to do the commands. So, a few minutes maybe.

Perfect! I hope this will motivate and encourage others here to ditch their old TEFT based data and switch to TOE too.


I mysqldump'd the dspam_stats table (all 1 user) and the token table (all 1 user) on Lion server since I wanted to copy those over to one of the real mail servers. So, loading it was trivial of course. And, since the training was done on a local machine, it was no big deal.

At the same time, I updated the Macports Portfile to compile the latest dspam. I'll have to check this in on Macports when I get a chance so others can use.


    I ask because it would be good if your experience in switching
    from TEFT to TOE and using a merged group and that additional
    training and deleting your whole user data, etc.... could motivate
    others in following your example.

Well, it wasn't that bad really. As you can tell from my other responses above. Obivously, the only downside is not being able to retrain recently received mail that was received before the conversion. Very small price to pay.

Well.... if one has a problem with that then not truncating the signature table will still allow one to retrain recently received mails. But I really suggest to not do that. Have the courage to delete the old stuff and look forward into a new setup with better accuracy.


    Which was to be expected. TEFT is an evil relict from the past.

But it's the default! That makes no logical sense to me that the devs won't change the default! I am sure I get the reason, but, I would tend to disagree with it.


I know, I know, I know. I am still waiting for Tom's blessing allowing me to change the default :)

--
Kind Regards from Switzerland,

Stevan Bajić

------------------------------------------------------------------------------
For Developers, A Lot Can Happen In A Second.
Boundary is the first to Know...and Tell You.
Monitor Your Applications in Ultra-Fine Resolution. Try it FREE!
http://p.sf.net/sfu/Boundary-d2dvs2
_______________________________________________
Dspam-user mailing list
Dspam-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspam-user

Reply via email to