Hello:
I'm in the process of converting my small freebsd server from
sendmail/crm114/mbox to exim/dspam/maildir. Anyway, this is proving
to be a bigger job than I originally planned (aren't they all). I
apologize for the long email, but I need to explain the situation.
I first installed all the necessary programs on my desktop pc
(freebsd/i386) so that I could get my test configuration working
before I took everything "live".
I installed dspam from freebsd's ports. I'm using a mysql 5.0
backend, teft, not in daemon mode with exim as the LDA and procmail
putting my spam into a mailbox based on headers. I decided to train
dspam with my last 1000 hams and 1000 spam messages, so I filtered
out my CRM114 headers with grep, converted each mbox to maildir and
fed the resulting directories to dspam_train. I'm still not sure if
want "pretraining" or not, but it at least confirmed dspam was
working. The results were as follows:
TP True Positives: 977
TN True Negatives: 998
FP False Positives: 2
FN False Negatives: 23
SC Spam Corpusfed: 0
NC Nonspam Corpusfed: 0
TL Training Left: 1500
SHR Spam Hit Rate 97.70%
HSR Ham Strike Rate: 0.20%
OCA Overall Accuracy: 98.75%
Not bad I thought, so then, I felt happy with everything and I
installed everything the same way (from the ports tree with the same
options) on my server which incidentally is freebsd/sparc64. The
results of my identical training are as follows:
TP True Positives: 913
TN True Negatives: 1000
FP False Positives: 0
FN False Negatives: 87
SC Spam Corpusfed: 1
NC Nonspam Corpusfed: 0
TL Training Left: 1500
SHR Spam Hit Rate 91.30%
HSR Ham Strike Rate: 0.00%
OCA Overall Accuracy: 95.65%
Why is dspam so much better on my athlon than it is on my
ultrasparc. The versions of freebsd are identical, with the same
version of dspam, the same build variables, the same training corpus
(and the logs indicate the messages were processed in the same
order. I've run the training several times now (starting with an
empty mysql db and the same X-CRM114 header stripped mbox files) and
the results from each machine are reproducible.
Any ideas? I don't want to put the less accurate dspam into
production (especially if I've found a bug). I can send the log
files if that would help, just let me know what other info is relevant.
Thanks,
-Peter