On 20.04.2012 07:32, Steve Fatula wrote:
*From:* Stevan Bajić <ste...@bajic.ch>
*To:* "dspam-user@lists.sourceforge.net"
<dspam-user@lists.sourceforge.net>
*Sent:* Thursday, April 19, 2012 5:49 PM
*Subject:* Re: [Dspam-user] Increase Spam Hit Rate
Spam: 1.391985007296783157922995792876 %
Ham: 98.608014992703216842077004207124 %
This is crazy. You have about 70 times more Ham tokens in the
database than Spam tokens. Your tokens are totally unbalanced.
But, that's a fact, right?
Yes.
It just means not that much spam, and, lots of hammy words, right?
Yes.
it does not indicate much wrong. I do classify every single spam.
It does not indicate any wrong doing from your part. But considering
your statement that spam is not well captured and looking at the amount
of Spam/Ham tokens then one can say that this is a strong indication
that the chosen training mode is not good for you.
Why do you set the training buffer to 3? Why not the 5
(the default)? Or why not disabling it?
Oh, I don't remember any more. It was a while ago. At the
time, I believe not a single message was ever classified as
SPAM, so, was experimenting during the training period.
This sounds strange to me. I mean the fact that not a single
message was EVER classified as SPAM.
Well, the question is when was the first spame detected, i.e.,
after how many false negatives. That I do not recall, so, it may
not be too strange. It's jsut why I had changed the setting to
see what impact it might have had.
Did you run from the beginning with OSB or had you another
tokenizer before?
From the very first email, osb. But I also report to (gasp) spamcop,
and, it does seem to remove spammers many times, but not always of
course. For a couple cases, I even went to the pain of reporting to
the data center and they did take care of it. So, done pretty well on
the controlling front. And, like another suggested who said they don't
even use dspam much, we do have a lot of anti-spam technology on the
postfix end.
I have a lot of anti-spam technology running before DSPAM. Usually my
Spam inbound is somewhere between 3% and 6% (this is not only for me.
The whole email flow for all the domains combined together, including
Spam honey pots). Some domains could easily live without DSPAM because
all their Spam is already blocked before the processing is reaching DSPAM.
Probably would be a good idea if it was handled as hashed tables.
I don't see why everyone wouldn't want to ignore most of the
headers. Most all of them are useless.
As usual: People think the more data there is to process the
better the engine can decide which class a message is.
I read in the past a document where they explicitly mentioned that
ignoring certain headers (like the date, message id, etc) would
slightly (1 digit percentage) increase the accuracy.
I don't see how having useless data = better results (not you,
"people"). I would think it would be more likely to give incorrect
results, I am surpised by 1%.
Ach. This is a phenomena that is hard to describe. I see that all the
time. I remember already decades ago when doing first project things
like this:
IT guy: What data do we need to store?
Customer: Just ABC.
IT guy: Okay. We are collecting ABC but we as well have XYZ.
Customer: Good. I just need ABC.
IT guy: We are going to store ABC and XYZ.
Customer: Why?
IT guy: Well.... we got that data when clients are using our application.
Customer: And?
IT guy: And why not storing XYZ as well in the database? It's cheap to
store and you might one day need XYZ.
Customer: Okay. I just need ABC and as long it does not disturb speed or
anything else, you can store XYZ.
IT people tend to stockpile data.
Right. But judging from your low Spam tokens from above I am not
sure how much that would be. Do you have a lot of Spam messages?
If you want I can easily provide you good quality Spam corpi with
many (thousands) of Spam messages. For example: One Spam corpus
creator that I often use to check how well my DSPAM is working has
414'731 Spam messages in his 2011 corpus, 38'807 for January 2012,
48'237 for February 2012, 49'178 for March 2012 and so far 20'705
for April 2012. Alone training with his data from year 2011 and
the available data from 2012 would give you over a half Million
spam mails to train with. Maybe I am wrong but I would be
surprised if you have that many data available to train with.
That would be the understatement of the century. I have all my spams
going back to Jan 1 as it turns out. That would be all of 80 messages
in my case, not quite adding up to hundreds of thousands, almost. ;-)
Others on the system likely have more, but, I do not. Almost all of it
is yahoo, hotmail spam.
I would *love* to have the corpus for 2011 and 2012. PLEASE, send them
my way. That would likely benefit the others. The question of course
is how. Can I download them from somewhere? Sounds too big to email. I
can give you a FTP site if you do not have a place to put them.
I will send you a mail and you can download them yourself.
If you really want to benefit others with it then I strongly urge you to
make a merged group. That will allow you to prepare stuff in advance
without disturbing any one and then when you are finished you just turn
on the merged group and out of no where all your users will have an
increase in accuracy.
Or if you want you could send me your Ham coprus and I can train
with my own made training method(*) and create for you a merged
group and export the SQL data and send it to you.
If you give me the SPAM corpus, I can just run dspam_train on it (and
I'd even add my 80). But it will be pretty unbalanced since I have few
HAM messages since I only keep a month (maybe a few thousand
messages). I am not sure that matters much? In the end, won't the
detection still work, maybe biased towards SPAM at first, but, surely,
it woudln't take too long to stop false positives?
You are right. If you run TOE then a unbalanced training corpus is not
such a big problem, since DSPAM will only learn on error.
Thanks in advance!
--
Kind Regards from Switzerland,
Stevan Bajić
------------------------------------------------------------------------------
For Developers, A Lot Can Happen In A Second.
Boundary is the first to Know...and Tell You.
Monitor Your Applications in Ultra-Fine Resolution. Try it FREE!
http://p.sf.net/sfu/Boundary-d2dvs2
_______________________________________________
Dspam-user mailing list
Dspam-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspam-user