On 02.09.2011 08:14, Ladar Levison wrote:
One 8/28/2011 6:12 AM, Stevan Bajic wrote:
Should be fixed in GIT
Further testing found another code path that leads to a memory leak. I
didn't realize these lines were inside a for loop the first time
around; and without the loop a free call wouldn't be needed.
Fixed in GIT. Thanks for the patch.
Most people use the Apache SpamAssassin corpi for testing or the TREC
corpi.
I was wondering if there was an actively maintained corpus available.
Does it mater if the corpus is actively maintained or not? For testing
email classification the only thing that count is the quality of the corpus.
Something like compressionratings.com -- but for email classification.
Something I could test my configuration/build against to see where it
ranks.
You can use dspam_train for that. It will split out the score after it
has finished with training. Those numbers can be used to compare against
others that have used the same corpus for training.
You could use dspam_train and use dspam_stats to set/reset the snapshot.
Yeah! But I'll need a script or tool to automate the testing process?
I use the library API so I'm not a good person to write such a test
script.
dspam_train does this (see man dspam_train):
NAME
dspam_train - train a corpus of mail
SYNOPSIS
dspam_train [username] [--client] [-i index|spam_corpus
nonspam_corpus]
DESCRIPTION
dspam_train is used to train and test a corpus of mail (in
maildir or MBOX format). This tool will present each message to DSPAM
for a classification and then retrain only if the
message was incorrect. This provides close to real-world
training and should be used to build pretrained databases. Upon
execution, the tool will automatically determine the ratio
of spam:nonspam and train based on that ratio to ensure both
corpora are trained consecutively. This tool can also be used as a test
jig to measure the efficiency and accuracy of
a particular corpus against DSPAM in a given configuration.
OPTIONS
--client
If specified, DSPAM is used in client-server mode.
username
Specifies the user to train, if omitted the current user
name is used.
-i index
Use a index file instead of the usual spam_corpus and
nonspam_corpus.
index : Path to the index file having the following
format per line:
[class] [path to message]
spam_corpus
Specifies either the pathname to the directory containing
the corpus of spam, with each in a separate file (e.g. maildir format)
or a path to the mailbox in the tra-
ditional Unix MBOX format.
nonspam_corpus
Specifies either the pathname to the directory containing
the corpus of nonspam with each message in a separate file or a path to
the mailbox in the traditional Unix
MBOX format.
EXIT VALUE
0 Operation was successful.
other Operation resulted in an error.
COPYRIGHT
Copyright (C) 2002-2011 DSPAM Project
All rights reserved.
For more information, see http://dspam.sourceforge.net.
I don't understand what you mean with this? Are you trying to get a
certain score/result that you can compare with the other DSPAM
users/developers?
Exactly! If I run my build against an identical corpus I should get
identical results!
Well... yes and no. You can change a lot of stuff inside dspam.conf that
influences the results.
If the results vary, I know its time look for bugs. The goal is to
catch a string function or memory allocator that behaves differently.
I can always decide the deviation is small enough to ignore... but
only if I have results for comparison.
String functions or memory allocator errors? In what sense? Do you think
that your build would trigger such an error while it will not trigger on
another setup?
I don't know how other benchmark their setup (and if they even do
benchmark their setup)? I myself have developed over the years my own
testing and training method. I don't use stock DSPAM methods at all. I
guess other DSPAM users/admins have established their own test and
training procedures as well.
I was hoping to find the tools/scripts/notes to test my
build/implementation. It would be nice if those tools became part of
'make check'. If the testing bits take up too much space, just
distribute them as a separate tarball. The libxml2 project uses that
strategy. Each release tarball is paired with its own test tarball.
Check out ftp://xmlsoft.org/libxml2/ for the files.
As I have written in the past: It is hard to do the same with DSPAM
because of the storage backend that DSPAM uses. You can not just make
check during configure. It is not that easy.
This is difficult since the backend is configurable with ./configure but
it is most likely not initialized and a 'make check' would require to
have a properly configured backend (with all the schema and access
already setup), which is not available on a fresh/new setup during
compile time.
A good start might be to compile the command line utilities and/or
test programs using with the file system storage driver.
Not everyone is compiling that storage driver. Some one might choose to
just use MySQL and nothing else.
If the check variants are compiled in response to 'make check' and
stored inside the test folder they shouldn't cause any problems. Then
its just a matter of automating the test process. And if the results
are stored under the build tree they could be purged easily pruged
with 'make clean'. ClamAV ships with a test corpus and 'make check'
will test the corpus against the command line tools. It checks whether
a reasonable amount of memory was needed; that the program finished
quickly and most importantly that it generates the expected
classification.
None of those tools have artificial intelligence like DSPAM has. ClamAV
is just checking hashes. If DSPAM would work the same way then making a
check suite would be easy, but it is not. We could however go on and
code a suite that checks if the test "I am a test" is producing the
proper result when using the various tokenizers. But that's all.
This strategy could be used to test libdspam and could allow limited
testing of the command line utilities. IMO thats the most important
chunk of code.
When time allows; adding logic to test different storage
configurations shouldn't be possible. Just write the check script with
the assumption a valid test database available. If the dspam user
won't connect to the localhost using the password 'bajic' then 'make
check' simply fails.
'bajic'? LOL. I am just a coding monkey. That's all. I for sure don't
want to taint users database backends with my family name as password.
If you are so ultra giga horny for tests then I could build a database
for MySQL and one for PostgreSQL that has tokens and dump that data and
upload it to Sourceforge. Users could then use that data to do some
limited testing.
If you wanted to get a little more complicated, try executing the
RDBMS binary against a localized config file. Then initialize your
blank database schema and listen for connections via a file socket or
named pipe. Since the database files are stored inside the build tree,
they can be pruged and recreated each time 'make check 'is called.
Checkout the MySQL tarball and run "./configure; make && make check"
for the details.
P.S. If anyone else decides to test DSPAM using Valgrind, the current
release (3.6.1) will complain about glibc str functions reading dirty
memory via aligned reads. The issue is fixed in the valgrind code
repository -- for those willing/able to compile a 3.7.0 snapshot.
_______________________________________________
Dspam-devel mailing list
Dspam-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspam-devel
------------------------------------------------------------------------------
Special Offer -- Download ArcSight Logger for FREE!
Finally, a world-class log management solution at an even better
price-free! And you'll get a free "Love Thy Logs" t-shirt when you
download Logger. Secure your free ArcSight Logger TODAY!
http://p.sf.net/sfu/arcsisghtdev2dev
_______________________________________________
Dspam-devel mailing list
Dspam-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspam-devel