On 02.09.2011 08:14, Ladar Levison wrote:
One 8/28/2011 6:12 AM, Stevan Bajic wrote:
Should be fixed in GIT
Further testing found another code path that leads to a memory leak. I didn't realize these lines were inside a for loop the first time around; and without the loop a free call wouldn't be needed.

Fixed in GIT. Thanks for the patch.


Most people use the Apache SpamAssassin corpi for testing or the TREC corpi.
I was wondering if there was an actively maintained corpus available.
Does it mater if the corpus is actively maintained or not? For testing email classification the only thing that count is the quality of the corpus.


Something like compressionratings.com -- but for email classification. Something I could test my configuration/build against to see where it ranks.

You can use dspam_train for that. It will split out the score after it has finished with training. Those numbers can be used to compare against others that have used the same corpus for training.


You could use dspam_train and use dspam_stats to set/reset the snapshot.

Yeah! But I'll need a script or tool to automate the testing process? I use the library API so I'm not a good person to write such a test script.

dspam_train does this (see man dspam_train):
NAME
       dspam_train - train a corpus of mail


SYNOPSIS
dspam_train [username] [--client] [-i index|spam_corpus nonspam_corpus]


DESCRIPTION
dspam_train is used to train and test a corpus of mail (in maildir or MBOX format). This tool will present each message to DSPAM for a classification and then retrain only if the message was incorrect. This provides close to real-world training and should be used to build pretrained databases. Upon execution, the tool will automatically determine the ratio of spam:nonspam and train based on that ratio to ensure both corpora are trained consecutively. This tool can also be used as a test jig to measure the efficiency and accuracy of
       a particular corpus against DSPAM in a given configuration.


OPTIONS
       --client
              If specified, DSPAM is used in client-server mode.


       username
Specifies the user to train, if omitted the current user name is used.


       -i index
Use a index file instead of the usual spam_corpus and nonspam_corpus.

index : Path to the index file having the following format per line:
              [class] [path to message]


       spam_corpus
Specifies either the pathname to the directory containing the corpus of spam, with each in a separate file (e.g. maildir format) or a path to the mailbox in the tra-
              ditional Unix MBOX format.


       nonspam_corpus
Specifies either the pathname to the directory containing the corpus of nonspam with each message in a separate file or a path to the mailbox in the traditional Unix
              MBOX format.


EXIT VALUE
       0      Operation was successful.
       other  Operation resulted in an error.


COPYRIGHT
       Copyright (C) 2002-2011 DSPAM Project
       All rights reserved.

       For more information, see http://dspam.sourceforge.net.




I don't understand what you mean with this? Are you trying to get a
certain score/result that you can compare with the other DSPAM
users/developers?

Exactly! If I run my build against an identical corpus I should get identical results!
Well... yes and no. You can change a lot of stuff inside dspam.conf that influences the results.


If the results vary, I know its time look for bugs. The goal is to catch a string function or memory allocator that behaves differently. I can always decide the deviation is small enough to ignore... but only if I have results for comparison.

String functions or memory allocator errors? In what sense? Do you think that your build would trigger such an error while it will not trigger on another setup?


I don't know how other benchmark their setup (and if they even do
benchmark their setup)? I myself have developed over the years my own
testing and training method. I don't use stock DSPAM methods at all. I
guess other DSPAM users/admins have established their own test and
training procedures as well.

I was hoping to find the tools/scripts/notes to test my build/implementation. It would be nice if those tools became part of 'make check'. If the testing bits take up too much space, just distribute them as a separate tarball. The libxml2 project uses that strategy. Each release tarball is paired with its own test tarball. Check out ftp://xmlsoft.org/libxml2/ for the files.

As I have written in the past: It is hard to do the same with DSPAM because of the storage backend that DSPAM uses. You can not just make check during configure. It is not that easy.


This is difficult since the backend is configurable with ./configure but
it is most likely not initialized and a 'make check' would require to
have a properly configured backend (with all the schema and access
already setup), which is not available on a fresh/new setup during
compile time.

A good start might be to compile the command line utilities and/or test programs using with the file system storage driver.
Not everyone is compiling that storage driver. Some one might choose to just use MySQL and nothing else.


If the check variants are compiled in response to 'make check' and stored inside the test folder they shouldn't cause any problems. Then its just a matter of automating the test process. And if the results are stored under the build tree they could be purged easily pruged with 'make clean'. ClamAV ships with a test corpus and 'make check' will test the corpus against the command line tools. It checks whether a reasonable amount of memory was needed; that the program finished quickly and most importantly that it generates the expected classification.

None of those tools have artificial intelligence like DSPAM has. ClamAV is just checking hashes. If DSPAM would work the same way then making a check suite would be easy, but it is not. We could however go on and code a suite that checks if the test "I am a test" is producing the proper result when using the various tokenizers. But that's all.

This strategy could be used to test libdspam and could allow limited testing of the command line utilities. IMO thats the most important chunk of code.

When time allows; adding logic to test different storage configurations shouldn't be possible. Just write the check script with the assumption a valid test database available. If the dspam user won't connect to the localhost using the password 'bajic' then 'make check' simply fails.

'bajic'? LOL. I am just a coding monkey. That's all. I for sure don't want to taint users database backends with my family name as password. If you are so ultra giga horny for tests then I could build a database for MySQL and one for PostgreSQL that has tokens and dump that data and upload it to Sourceforge. Users could then use that data to do some limited testing.


If you wanted to get a little more complicated, try executing the RDBMS binary against a localized config file. Then initialize your blank database schema and listen for connections via a file socket or named pipe. Since the database files are stored inside the build tree, they can be pruged and recreated each time 'make check 'is called. Checkout the MySQL tarball and run "./configure; make && make check" for the details.

P.S. If anyone else decides to test DSPAM using Valgrind, the current release (3.6.1) will complain about glibc str functions reading dirty memory via aligned reads. The issue is fixed in the valgrind code repository -- for those willing/able to compile a 3.7.0 snapshot.

_______________________________________________
Dspam-devel mailing list
Dspam-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspam-devel

------------------------------------------------------------------------------
Special Offer -- Download ArcSight Logger for FREE!
Finally, a world-class log management solution at an even better 
price-free! And you'll get a free "Love Thy Logs" t-shirt when you
download Logger. Secure your free ArcSight Logger TODAY!
http://p.sf.net/sfu/arcsisghtdev2dev
_______________________________________________
Dspam-devel mailing list
Dspam-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspam-devel

Reply via email to