On 22.04.2012 23:12, Steve Fatula wrote:
*From:* Stevan Bajić <ste...@bajic.ch>
*To:* "dspam-user@lists.sourceforge.net"
<dspam-user@lists.sourceforge.net>
*Sent:* Sunday, April 22, 2012 2:28 PM
*Subject:* Re: [Dspam-user] Increase Spam Hit Rate
That is correct, as I had mentioned, I did not have a lot of HAM
to train with.
At least you have around 33K Ham messages. This is not that bad.
Well, not really. Remember, on your instructions, I essentially
trained the HAM many times, so, it's that many divided by the number
of months from the SPAM corpus, around 16 I think.
Yes. I remember. Thanks god you run TOE. With TEFT that kind of training
would have superseded the whole purpose of the training.
I have since redone and added more HAM that I dug up.
I will build it up over time.
Maybe you get even more if you use the mails from the send folder
of all the users.
I got as much sent as I could, it's only retained for 30 days.
Okay. Then maybe you should go on and download other Ham corpora too?
TREC 2005: http://plg.uwaterloo.ca/~gvcormac/treccorpus/
TREC 2006: http://plg.uwaterloo.ca/~gvcormac/treccorpus06/
TREC 2007: http://plg.uwaterloo.ca/~gvcormac/treccorpus07/
SpamAssassin Public Corpus: http://spamassassin.apache.org/publiccorpus/
Most of them are old but still good enough for training just the Ham
part of the corpora. Off course no one is stopping you from training the
full corpus with Spam and Ham.
Suprisingly, the first day went well.
You mean the first day using the merged group from above? I told
you that this approach will work. Tell me more. How long did it
took you to train with that many messages? How long was the
downtime? Was the production downtime as low as I told you?
Yes, using the merged group. Since I wanted a master machine with
Dspam database and only the merged group training data, so, I could
use to copy to other systems, I just used my handy Lion server at
home. Loaded up Dspam and ran the trains on it. I don't recall how
long, maybe 8 hours? Production downtime was close to 0 since there
wasn't much to do and I had a MySQL script already set up to do the
commands. So, a few minutes maybe.
Perfect! I hope this will motivate and encourage others here to ditch
their old TEFT based data and switch to TOE too.
I mysqldump'd the dspam_stats table (all 1 user) and the token table
(all 1 user) on Lion server since I wanted to copy those over to one
of the real mail servers. So, loading it was trivial of course. And,
since the training was done on a local machine, it was no big deal.
At the same time, I updated the Macports Portfile to compile the
latest dspam. I'll have to check this in on Macports when I get a
chance so others can use.
I ask because it would be good if your experience in switching
from TEFT to TOE and using a merged group and that additional
training and deleting your whole user data, etc.... could motivate
others in following your example.
Well, it wasn't that bad really. As you can tell from my other
responses above. Obivously, the only downside is not being able to
retrain recently received mail that was received before the
conversion. Very small price to pay.
Well.... if one has a problem with that then not truncating the
signature table will still allow one to retrain recently received mails.
But I really suggest to not do that. Have the courage to delete the old
stuff and look forward into a new setup with better accuracy.
Which was to be expected. TEFT is an evil relict from the past.
But it's the default! That makes no logical sense to me that the devs
won't change the default! I am sure I get the reason, but, I would
tend to disagree with it.
I know, I know, I know. I am still waiting for Tom's blessing allowing
me to change the default :)
--
Kind Regards from Switzerland,
Stevan Bajić
------------------------------------------------------------------------------
For Developers, A Lot Can Happen In A Second.
Boundary is the first to Know...and Tell You.
Monitor Your Applications in Ultra-Fine Resolution. Try it FREE!
http://p.sf.net/sfu/Boundary-d2dvs2
_______________________________________________
Dspam-user mailing list
Dspam-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspam-user