From: Stevan Bajić <ste...@bajic.ch>
>To: "dspam-user@lists.sourceforge.net" <dspam-user@lists.sourceforge.net>
>Sent: Thursday, April 19, 2012 5:49 PM
>Subject: Re: [Dspam-user] Increase Spam Hit Rate
>
>
>Spam: 1.391985007296783157922995792876 %
>Ham: 98.608014992703216842077004207124 %
>
>This is crazy. You have about 70 times more Ham tokens in the
database than Spam tokens. Your tokens are totally unbalanced.
>
>But, that's a fact, right? It just means not that much spam, and, lots of
>hammy words, right? it does not indicate much wrong. I do classify every
>single spam.
Why do you set the training buffer to 3? Why not the 5 (the default)? Or why
not disabling it?
>>>>>
>>>>>
Oh, I don't remember any more. It was a while ago. At the time, I believe not a
single message was ever classified as SPAM, so, was experimenting during the
training period.
This sounds strange to me. I mean the fact that not a single message was EVER
classified as SPAM.
>>>
>>>
Well, the question is when was the first spame detected, i.e., after how many
false negatives. That I do not recall, so, it may not be too strange. It's jsut
why I had changed the setting to see what impact it might have had.
>>
>>
Did you run from the beginning with OSB or had you another tokenizer before?
>
>From the very first email, osb. But I also report to (gasp) spamcop, and, it
>does seem to remove spammers many times, but not always of course. For a
>couple cases, I even went to the pain of reporting to the data center and they
>did take care of it. So, done pretty well on the controlling front. And, like
>another suggested who said they don't even use dspam much, we do have a lot of
>anti-spam technology on the postfix end.
Probably would be a good idea if it was handled as hashed tables. I don't see
why everyone wouldn't want to ignore most of the headers. Most all of them are
useless.
>>
>>
As usual: People think the more data there is to process the better the engine
can decide which class a message is.
>I read in the past a document where they explicitly mentioned that
ignoring certain headers (like the date, message id, etc) would
slightly (1 digit percentage) increase the accuracy.
>
>I don't see how having useless data = better results (not you, "people"). I
>would think it would be more likely to give incorrect results, I am surpised
>by 1%.
>>
Right. But judging from your low Spam tokens from above I am not sure how much
that would be. Do you have a lot of Spam messages? If you want I can easily
provide you good quality Spam corpi with many (thousands) of Spam messages. For
example: One Spam corpus creator that I often use to check how well my DSPAM is
working has 414'731 Spam messages in his 2011 corpus, 38'807 for January 2012,
48'237 for February 2012, 49'178 for March 2012 and so far 20'705 for April
2012. Alone training with his data from year 2011 and the available data from
2012 would give you over a half Million spam mails to train with. Maybe I am
wrong but I would be surprised if you have that many data available to train
with.
>
>That would be the understatement of the century. I have all my spams going
>back to Jan 1 as it turns out. That would be all of 80 messages in my case,
>not quite adding up to hundreds of thousands, almost. ;-) Others on the system
>likely have more, but, I do not. Almost all of it is yahoo, hotmail spam.
I would *love* to have the corpus for 2011 and 2012. PLEASE, send them my way.
That would likely benefit the others. The question of course is how. Can I
download them from somewhere? Sounds too big to email. I can give you a FTP
site if you do not have a place to put them.
>Or if you want you could send me your Ham coprus and I can train
with my own made training method(*) and create for you a merged
group and export the SQL data and send it to you.
>
>If you give me the SPAM corpus, I can just run dspam_train on it (and I'd even
>add my 80). But it will be pretty unbalanced since I have few HAM messages
>since I only keep a month (maybe a few thousand messages). I am not sure that
>matters much? In the end, won't the detection still work, maybe biased towards
>SPAM at first, but, surely, it woudln't take too long to stop false positives?
Thanks in advance!
------------------------------------------------------------------------------
For Developers, A Lot Can Happen In A Second.
Boundary is the first to Know...and Tell You.
Monitor Your Applications in Ultra-Fine Resolution. Try it FREE!
http://p.sf.net/sfu/Boundary-d2dvs2
_______________________________________________
Dspam-user mailing list
Dspam-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspam-user