Re: [Dspam-devel] A memory leak and the reading of uninitialized bytes in 3.10.1... patch attached

Stevan Bajic Sat, 03 Sep 2011 08:03:03 -0700

On 02.09.2011 08:14, Ladar Levison wrote:

One 8/28/2011 6:12 AM, Stevan Bajic wrote:
Should be fixed in GIT
Further testing found another code path that leads to a memory leak. Ididn't realize these lines were inside a for loop the first timearound; and without the loop a free call wouldn't be needed.

Fixed in GIT. Thanks for the patch.

Most people use the Apache SpamAssassin corpi for testing or the TRECcorpi.
I was wondering if there was an actively maintained corpus available.

Does it mater if the corpus is actively maintained or not? For testingemail classification the only thing that count is the quality of the corpus.

Something like compressionratings.com -- but for email classification.Something I could test my configuration/build against to see where itranks.

You can use dspam_train for that. It will split out the score after ithas finished with training. Those numbers can be used to compare againstothers that have used the same corpus for training.

You could use dspam_train and use dspam_stats to set/reset the snapshot.
Yeah! But I'll need a script or tool to automate the testing process?I use the library API so I'm not a good person to write such a testscript.

dspam_train does this (see man dspam_train):
NAME
       dspam_train - train a corpus of mail


SYNOPSIS

dspam_train [username] [--client] [-i index|spam_corpusnonspam_corpus]



DESCRIPTION

dspam_train is used to train and test a corpus of mail (inmaildir or MBOX format). This tool will present each message to DSPAMfor a classification and then retrain only if themessage was incorrect. This provides close to real-worldtraining and should be used to build pretrained databases. Uponexecution, the tool will automatically determine the ratioof spam:nonspam and train based on that ratio to ensure bothcorpora are trained consecutively. This tool can also be used as a testjig to measure the efficiency and accuracy of

       a particular corpus against DSPAM in a given configuration.


OPTIONS
       --client
              If specified, DSPAM is used in client-server mode.


       username

Specifies the user to train, if omitted the current username is used.



       -i index

Use a index file instead of the usual spam_corpus andnonspam_corpus.

index : Path to the index file having the followingformat per line:

              [class] [path to message]


       spam_corpus

Specifies either the pathname to the directory containingthe corpus of spam, with each in a separate file (e.g. maildir format)or a path to the mailbox in the tra-

              ditional Unix MBOX format.


       nonspam_corpus

Specifies either the pathname to the directory containingthe corpus of nonspam with each message in a separate file or a path tothe mailbox in the traditional Unix

              MBOX format.


EXIT VALUE
       0      Operation was successful.
       other  Operation resulted in an error.


COPYRIGHT
       Copyright (C) 2002-2011 DSPAM Project
       All rights reserved.

       For more information, see http://dspam.sourceforge.net.

I don't understand what you mean with this? Are you trying to get a
certain score/result that you can compare with the other DSPAM
users/developers?
Exactly! If I run my build against an identical corpus I should getidentical results!

Well... yes and no. You can change a lot of stuff inside dspam.conf thatinfluences the results.

If the results vary, I know its time look for bugs. The goal is tocatch a string function or memory allocator that behaves differently.I can always decide the deviation is small enough to ignore... butonly if I have results for comparison.

String functions or memory allocator errors? In what sense? Do you thinkthat your build would trigger such an error while it will not trigger onanother setup?

I don't know how other benchmark their setup (and if they even do
benchmark their setup)? I myself have developed over the years my own
testing and training method. I don't use stock DSPAM methods at all. I
guess other DSPAM users/admins have established their own test and
training procedures as well.
I was hoping to find the tools/scripts/notes to test mybuild/implementation. It would be nice if those tools became part of'make check'. If the testing bits take up too much space, justdistribute them as a separate tarball. The libxml2 project uses thatstrategy. Each release tarball is paired with its own test tarball.Check out ftp://xmlsoft.org/libxml2/ for the files.

As I have written in the past: It is hard to do the same with DSPAMbecause of the storage backend that DSPAM uses. You can not just makecheck during configure. It is not that easy.

This is difficult since the backend is configurable with ./configure but
it is most likely not initialized and a 'make check' would require to
have a properly configured backend (with all the schema and access
already setup), which is not available on a fresh/new setup during
compile time.

A good start might be to compile the command line utilities and/ortest programs using with the file system storage driver.

Not everyone is compiling that storage driver. Some one might choose tojust use MySQL and nothing else.

If the check variants are compiled in response to 'make check' andstored inside the test folder they shouldn't cause any problems. Thenits just a matter of automating the test process. And if the resultsare stored under the build tree they could be purged easily prugedwith 'make clean'. ClamAV ships with a test corpus and 'make check'will test the corpus against the command line tools. It checks whethera reasonable amount of memory was needed; that the program finishedquickly and most importantly that it generates the expectedclassification.

None of those tools have artificial intelligence like DSPAM has. ClamAVis just checking hashes. If DSPAM would work the same way then making acheck suite would be easy, but it is not. We could however go on andcode a suite that checks if the test "I am a test" is producing theproper result when using the various tokenizers. But that's all.

This strategy could be used to test libdspam and could allow limitedtesting of the command line utilities. IMO thats the most importantchunk of code.
When time allows; adding logic to test different storageconfigurations shouldn't be possible. Just write the check script withthe assumption a valid test database available. If the dspam userwon't connect to the localhost using the password 'bajic' then 'makecheck' simply fails.

'bajic'? LOL. I am just a coding monkey. That's all. I for sure don'twant to taint users database backends with my family name as password.If you are so ultra giga horny for tests then I could build a databasefor MySQL and one for PostgreSQL that has tokens and dump that data andupload it to Sourceforge. Users could then use that data to do somelimited testing.

If you wanted to get a little more complicated, try executing theRDBMS binary against a localized config file. Then initialize yourblank database schema and listen for connections via a file socket ornamed pipe. Since the database files are stored inside the build tree,they can be pruged and recreated each time 'make check 'is called.Checkout the MySQL tarball and run "./configure; make && make check"for the details.
P.S. If anyone else decides to test DSPAM using Valgrind, the currentrelease (3.6.1) will complain about glibc str functions reading dirtymemory via aligned reads. The issue is fixed in the valgrind coderepository -- for those willing/able to compile a 3.7.0 snapshot.
_______________________________________________
Dspam-devel mailing list
Dspam-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspam-devel

------------------------------------------------------------------------------
Special Offer -- Download ArcSight Logger for FREE!
Finally, a world-class log management solution at an even better 
price-free! And you'll get a free "Love Thy Logs" t-shirt when you
download Logger. Secure your free ArcSight Logger TODAY!
http://p.sf.net/sfu/arcsisghtdev2dev

_______________________________________________
Dspam-devel mailing list
Dspam-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspam-devel

Re: [Dspam-devel] A memory leak and the reading of uninitialized bytes in 3.10.1... patch attached

Reply via email to