On 19.04.2012 00:54, Steve Fatula wrote:
*From:* Stevan Bajić <ste...@bajic.ch>
*To:* dspam-user@lists.sourceforge.net
*Sent:* Wednesday, April 18, 2012 3:41 PM
*Subject:* Re: [Dspam-user] Increase Spam Hit Rate
This is not good. But the above data is not that horrible.
Anyway... allow me to ask you a bunch of questions:
1) When you get a FN or a FP and then you retrain and later you
get almost the same message again, does DSPAM classify it
correctly? (aka: do you have the feeling DSPAM is learning quickly
or rather slowly)
If I get the exact same message, it sometimes still shows up on not
spam, but, mostly, it shows up as spam the next time. So, my answer is
it learns a specific message quickly.
Okay. This indicates then that you have a high in-balance in spam/ham
ratio. Can you post the result of the following SQL query:
select sum(spam_hits),sum(innocent_hits) from dspam_token_data where
uid=<insert_here_your_own_uid>;
2) Those ~ 6'000 processed messages from above are from an account
that is how many days/months/years old?
Well, I will assume this means while using dspam obvioously. I don't
actually recall. I would say possibly a year as a wild guess.
Okay.
#
# OnFail: What to do if local delivery or quarantine should fail.
If set
# to "unlearn", DSPAM will unlearn the message prior to exiting
with an
# un successful return code. The default option, "error" will not
unlearn
# the message but return the appropriate error code. The unlearn
option
# is use-ful on some systems where local delivery failures will
cause the
# message to be requeued for delivery, and could result in the
message
# being processed multiple times. During a very large failure,
however,
# this could cause a significant load increase.
#
OnFail unlearn
I would not unlrean the message on failures. Do you have any
reason to set this to 'unlearn'?
The reason would be the reason given in the comments. i.e., it would
be re-queued.
Aha. You got that wrong. The note says that it could be useful to set it
to unlearn IF a local delivery failure is resulting in a requeue of the
message. Is this the case for you? Does a failure in the local delivery
(in your case a delivery to the dovecot LMTP service) result in a
re-delivery/re-queue of the exact same message?
#
# Training Mode: The default training mode to use for all
operations, when
# one has not been specified on the commandline or in the user's
preferences.
# Acceptable values are:
# toe Train on Error (Only)
# teft Train Everything (Trains on every message)
# tum Train Until Mature (Train only tokens without
enough data)
# notrain Do not train or store signatures (large ISP
systems, post-train)
#
TrainingMode teft
OUCH! I really, really, really would do TOE here. TEFT is so 'old
school' and really does more harm than that it helps.
That's enough really's for me!
LOL. Sorry. I tried to put more weight on my answer.
#
# Features: Specify features to activate by default; can also be
specified
# on the commandline. See the documentation for a list of
available features.
# If _any_ features are specified on the commandline, these are
ignored.
#
#Feature noise
Feature whitelist
I strongly advise you to enable 'noise' too.
Ok, I don't really seem to get the sort of spam it was meant for, but,
wouldn't hurt.
I don't understand this answer. Can you rephrase it?
# Training Buffer: The training buffer waters down statistics
during training.
# It is designed to prevent false positives, but can also
dramatically reduce
# dspam's catch rate during initial training. This can be a
number from 0
# (no buffering) to 10 (maximum buffering). If you are paranoid
about false
# positives, you should probably enable this option.
#
Feature tb=3
Why do you set the training buffer to 3? Why not the 5 (the
default)? Or why not disabling it?
Oh, I don't remember any more. It was a while ago. At the time, I
believe not a single message was ever classified as SPAM, so, was
experimenting during the training period.
This sounds strange to me. I mean the fact that not a single message was
EVER classified as SPAM.
#
# Preferences: Specify any preferences to set by default, unless
otherwise
# overridden by the user (see next section) or a default.prefs file.
# If user or default.prefs are found, the user's preferences will
override any
# defaults.
#
Preference "trainingMode=TEFT"# { TOE | TUM | TEFT | NOTRAIN } ->
default:teft
Bad (IMHO). Set this to TOE.
Ok
Preference "enableBNR=off"# { on | off } -> default:off
I would enable BNR. This helps a lot.
Not a problem
# If you're running DSPAM in client/server (daemon) mode,
uncomment the
# setting below to override the default connection cache size
(the number
# of connections the server pools between all clients). The
connection cache
# represents the maximum number of database connections
*available* and should
# be set based on the maximum number of concurrent connections
you're likely
# to have. Each connection may be used by only one thread at a
time, so all
# other threads _will block_ until another connection becomes
available.
#
MySQLConnectionCache25
I miss a space before the '25'.
No, it's there in the file, it was a tab character I think
You where right. On my screen it was so close that I did not see the
space. But now I see it.
Ohhh boy! From where is that list? Looks like one of my older
IgnoreHeader list.
It's mostly someones list, don't recall whose. It seemed like a very
good idea when I tracked what dspam was doing with various messages,
wasting time on headers with useless data in them. Is that not a good
thing to ignore?
It is! Ignoring headers is good for accuracy but bad for speed (DSPAM
needs to compare every header against the list. Always processing the
whole ignore header list. We could make the code faster so that it uses
hashed tables but the C code today is not doing that).
I also have added a few headers on my own to messages, such as the geo
id and a few others I already have when pre-processing the message
anyway. I presumed this would help with countries that seem to be
mostly spammy.
Okay. In short: The config is okay. I would mainly go away from
TEFT. It is pure evil. While it might deliver you quickly results
in the beginning, it will bite you in the future and the older the
data gets in the storage backend. TOE is way better for you.
My advise would be (the order is important):
* Switch to TOE
* Enable 'noise' and 'BNR'
* Create new user (lets call that user SpamHitRate)
* Disable whitelisting and other mambo jambo for that user
* Train the user with dspam_train
* Remove ALL TOKENS and STATISTICS for ALL USERS except for the
user SpamHitRate
* Use the user SpamHitRate as a global merged group
Tell all users on your system that you fine tuned the anti spam
system and that they should expect the filter to make errors and
that you expect from them to correct those errors by doing
training. The good thing is that if they don't train / retrain the
system, they will not destroy as fast they accuracy with TOE as
they do with TEFT. The other good thing is that you can take all
the time in the world to train that 'SpamHitRate' user and do all
what is needed to get a good catch rate for that user and then
convert it to a merged global group and let instantly all your
users profit from that training. On one hand this will drastically
reduce downtime of the anti-spam filter and on the other hand it
will increase the catch rate instantly.
Ok, so, here's where it's unfamiliar to me. I have never used
dspam_train (user re-training works via dovecot antispam plugin). The
doc implies you want to use a corpus for this, I have none who purpose
was that. Certainly, I don't have all those messages that passed
through dspam the first time. Probably, 90% of them are gone. I delete
things.
So, is the thought here to take all messages I currently have in my
inbox, trash, etc., that I know are not spam and, run them through
dspam_train as nonspam, and, then take the one folder that is spam,
and run them through dspam_train as spam, and simply go from there? (I
understand the merged group portion I think, have never used one yet
though). So, I could do this for all users who participated in
training so that the one SpamHitRate user could benefit from all of those.
Spam corpi is ultra easy to get. Ham corpi is a problem. You can find
ham corpi on the net but usually it would be better to use your own.
What you could do is use the messages you (and your users) have sent.
Don't use the inbox because those messages have the X-DSPAM-... headers
and you would need to clean them. Use better the one you find in the
send folder.
As far as removing all tokens and stats for all but one user, there is
no utility for this is there? Or, do I simply craft a bunch of MySQL
statements?
No tool. Simply craft a SQL command.
If you want then I can help you to get there where you want to be
by sending you more info how to setup that new system. You heed
however really to delete your old data. It looks that your old
data is not good enough. And we have made DSPAM so much better
that erasing token data and starting from scratch is producing
very fast, very good results. In the past you had to wait
days/weeks until DSPAM catched up but today this is not any more
the case.
Well, if merged groups are documented somewhere pretty well, I am sure
I can figure it out. If you have something specifically, feel free to
send it my way.
I don't know if anyone has added something into the wiki about group
support? From the past I know that people often complain about
documentation. I personally find things to be good (not excellent or
super stellar good) documented. But I am I and if everyone out there
would be happy with the documentation then we would not have so much
complains about documentation. So I guess it is not pretty well documented.
Before I start. Have you read about group support in DSPAM? If not then
read this here ->
http://dspam.git.sourceforge.net/git/gitweb.cgi?p=dspam/dspam;a=blob;f=README;hb=HEAD#l1363
--
Kind Regards from Switzerland,
Stevan Bajić
------------------------------------------------------------------------------
Better than sec? Nothing is better than sec when it comes to
monitoring Big Data applications. Try Boundary one-second
resolution app monitoring today. Free.
http://p.sf.net/sfu/Boundary-dev2dev
_______________________________________________
Dspam-user mailing list
Dspam-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspam-user