Re: [dspam-users] dspam accuracy

Tony Earnshaw Sat, 03 Mar 2007 19:21:06 -0800

Shane Kumpf wrote, on 03. mar 2007 19:06:

One of my pop accounts that I pull down via fetchmail which isn’tcontrolled by me just changed there spam filtering software. In theprocess of this change they went from bouncing a lot of suspected spam,to now just quarantining it. Since I don’t use there quarantine,obviously I’m ending up with quite a bit more spam than I used to.

FWIW I'm in more or less the same position as you, doing more or lessthe same. I'm responsible for a high school email system (1150+ nominalusers, around 350 active mailers) running dspam as a daemon withremarkable accuracy. I had my private mail address on the school's server.

For my home machine (Red Hat RHAS4) in November/December last I decidedto take my private mail off the school's server and activate my ISP'sPOP account. The ISP has a spam and virus filtering service but I don'ttrust it, I trust myself.

I'm running more or less the same basic system that I have at school,but I use Fetchmail 6.3.6. I have a Postfix 2.3.6 MTA callingamavisd-new 2.4.5 with ClamAV 0.90.1 and BitDefender-Console-Antivirus7.3-1. A Postfix smtpd listener passes the mail to dspam CVS/MySQL4.1.20 which scans it and passes back to Postfix, which gives it tomaildrop for IMAP distribution.

Unfortunately I'm not getting much spam. I do what I can to aggravatethings to get more, like posting on newsgroups (which used to work well)with a throwaway address, but that mostly gets me a few "virus" (alsophishing stuff caught by ClamAV).

Dspam is having a lot of trouble classifying these new messages as spam.The strange thing is that all this spam looks very similar to the spamit is catching. I’m starting to wonder if I should wipe my databasesand start fresh, it’s been about a month and it doesn’t seem to begetting any better. I’m getting roughly 200 pieces of spam a day now.My stats have dropped considerably from about 90% accuracy to less than70 as you will see. Do you think that if I continue to train it will getbetter, or do to the size and age of my database that this new spam willhave trouble getting classified? Any info I can provide let me know.

I decided to start with a completely empty dspam db and see whathappened and I must say I'm pleased with the result up to now, dspam islearning relatively fast and beginning to judge sensibly, even to theextent that it's interpolating correctly (e.g. if it's had spam in Greekor French it recognizes spam in Spanish but leaves the local Dutch stuffalone - I haven't had any Dutch spam to date, though).

                TP True Positives:          17731

                TN True Negatives:          21733

                FP False Positives:         10937

                FN False Negatives:          6361

                SC Spam Corpusfed:           3741

                NC Nonspam Corpusfed:           1

                TL Training Left:               0

                SHR Spam Hit Rate          73.60%

                HSR Ham Strike Rate:       33.48%

                OCA Overall Accuracy:      69.53%


Mine started all askew but as of now it's:


                TP True Positives:             45
                TN True Negatives:           5940
                FP False Positives:             0
                FN False Negatives:            53
                SC Spam Corpusfed:              1
                NC Nonspam Corpusfed:           0
                TL Training Left:               0
                SHR Spam Hit Rate          45.92%
                HSR Ham Strike Rate:        0.00%
                OCA Overall Accuracy:      99.12%

At school it's:
                TP True Positives:          12914
                TN True Negatives:          87136
                FP False Positives:           384
                FN False Negatives:           344
                SC Spam Corpusfed:           3311
                NC Nonspam Corpusfed:        3002
                TL Training Left:               0
                SHR Spam Hit Rate          97.41%
                HSR Ham Strike Rate:        0.44%
                OCA Overall Accuracy:      99.28%

So not a wild difference between corpus feeding or not. The school getsmost correspondence in Dutch and to begin with (starting October last)dspam thought all Dutch stuff was spam (all the corpus was English) andgot mixed up, but it's mostly judging well now.


I'm using a shared group for both sites and my home dspam.conf looks like:

Home /var/dspam
DeliveryHost        127.0.0.1
DeliveryPort        10026
DeliveryIdent       dspam-out
DeliveryProto       SMTP
FallbackDomains on
OnFail error
Trust root
Trust nobody
Debug *
DebugOpt process spam fp innocent
TrainingMode toe
TestConditionalTraining on
Feature tb=3
Feature whitelist
Feature noise
Algorithm graham burton
PValue graham
SupressWebStats on
ImprobabilityDrive on
Preference "signatureLocation=headers"  # 'message' or 'headers'
AllowOverride trainingMode
AllowOverride spamAction spamSubject
AllowOverride statisticalSedation
AllowOverride enableBNR
AllowOverride enableWhitelist
AllowOverride showFactors
AllowOverride optIn optOut
AllowOverride whitelistThreshold
AllowOverride makeCorpus
AllowOverride fallbackDomain
AllowOverride trainingMode
MySQLServer     /var/lib/mysql/mysql.sock
MySQLUser               dspam
MySQLPass               dspam
MySQLDb                 dspamdb
MySQLConnectionCache    10
IgnoreHeader DomainKey-Signature
IgnoreHeader X-DKIM
IgnoreHeader X-Virus-Scanned
IgnoreHeader Delivered-To
IgnoreHeader In-Reply-To
IgnoreHeader X-OriginalArrivalTime
IgnoreHeader X-Disclaimer
IgnoreHeader X-Mailman-Approved-At
IgnoreHeader Archive
IgnoreHeader List-Post
IgnoreHeader List-Subscribe
IgnoreHeader List-Unsubscribe
IgnoreHeader List-Help
IgnoreHeader List-Id
IgnoreHeader Message-ID
Notifications   on
PurgeSignatures 21          # Stale signatures
PurgeNeutral    90          # Tokens with neutralish probabilities
PurgeUnused     90          # Unused tokens
PurgeHapaxes    30          # Tokens with less than 5 hits (hapaxes)
PurgeHits1S     15          # Tokens with only 1 spam hit
PurgeHits1I     15          # Tokens with only 1 innocent hit
LocalMX 127.0.0.1 192.168.0.3 213.75.3.22 213.10.163.78
SystemLog on
UserLog   on
Opt out
TrackSources spam
Broken lineStripping
MaxMessageSize 1024000
ServerHost              127.0.0.1
ServerPort              24
ServerQueueSize 32
ServerPID               /var/run/dspam.pid
ServerMode standard
ServerParameters       "--deliver=innocent,spam -d %u"
ServerIdent            "dspam-in"
ProcessorBias on

Best,

--Tonni

--
Tony Earnshaw
Email: tonni at hetnet dot nl

Re: [dspam-users] dspam accuracy

Reply via email to