On Wed, Dec 05, 2007 at 12:43:04AM +0100, Steve wrote:
> Hello list
> 
> I need help from PostgreSQL and from SQLite gurus. I have invested some time 
> in making the MySQL storage driver for DSPAM 3.8.0 more solid and enhanced. 
> Currently all of the SQL based storage drivers are full of small but nasty 
> errors. They all fail when you set in DSPAM the MaxMessageSize high enough 
> and train with a huge mail.
> 
> To illustrate the problem:
> 1) Increase your MaxMessageSize in DSPAM (make it 10MB, 15MB or more)
> 2) Download 
> http://www.cs.virginia.edu/~cs101/hws/hw6/markov/textfiles/bible.txt
> 3) Let DSPAM process and tag the text:
>    dspam --user <your_user> --process --deliver=summary --stdout <
> /path/to/bible.txt
> 
> The <your_user> should not be on NOTRAIN mode!
> 
> The result will probably be nothing. DSPAM will not output anything. But 
> sql.errors will probably contain a failed SQL query. And pretty sure your 
> DSPAM running in client/server mode will be crashed (well... sort of. It will 
> run but the connection to your storage engine will be gone for --client mode).
> 
> I have fixed that issue on my 3.8.0 installation with MySQL. When I train 
> with the patched DSPAM 3.8.0 then I get this:
> mail / # dspam --user globaluser --process --deliver=summary --stdout <
> /tmp/bible.txt
> X-DSPAM-Result: globaluser; result="Innocent"; class="Innocent";
> probability=0.0000; confidence=1.00; signature=1,4755c2c538801608415579
> mail / #
> 
> Checking for signature data I get this:
> mail / # mysql --user=$(sed -n "3,1p" /etc/mail/dspam/mysql.data)
> --password=$(sed -n "4,1p" /etc/mail/dspam/mysql.data) --socket=$(sed -n 
> "1,1p"
> /etc/mail/dspam/mysql.data) -e "select
> uid,signature,octet_length(data),length,created_on from dspam_signature_data
> where signature='1,4755c2c538801608415579'" $(sed -n "5,1p"
> /etc/mail/dspam/mysql.data)
> +-----+--------------------------+--------------------+---------+------------+
> | uid | signature                | octet_length(data) | length  | created_on |
> +-----+--------------------------+--------------------+---------+------------+
> |   1 | 1,4755c2c538801608415579 |            8804556 | 8804556 | 2007-12-04 |
> +-----+--------------------------+--------------------+---------+------------+
> mail / #
> 
> 
> With MySQL most people will get a length of 32767 since in the source a 
> signed long is used for filling length in MySQL. The bytes for data will 
> probably be 65535 which is the maximum for a BLOB type field in MySQL.
> 
> I have changed a lot of code in DSPAM 3.8.0 source to get the above 
> functionality. My problem now is that most of the changes are in the 
> mysql_drv.c source file but some part is outside that file and those changes 
> would affect any other storage engine. So for this change to be useful I need 
> to change the other storage drivers as well.
> 
> I know PostgreSQL and SQLite but not so good as I do MySQL. So my question or 
> request for help is: Is any one here in the list willing to help me to get 
> the other storage engines working the proper way? You don't need to be a c 
> coder (would help but not required). I just need some one I can ask about 
> PostgreSQL and/or SQLite if I have storage specific questions.
> 
> Any one willing to help?
> 
> 
> I know that no one is so crazy to train 766'111 words with an anti spam 
> filter. Especially not with anything more complex then unigram (I used noise 
> with osb with burton graham naive and bcr). But to know that it is possible 
> with DSPAM makes me more confident to use DSPAM.
> 
> 
> // SteveB

Hi Steve,

We have tried using PostgreSQL as the backend DB since version 7.4 and
ran into performance problems. In our attempts to address them we developed
the patch to use the native BIGINT type instead of the much slower NUMERIC
type. Version 8.2 was the first version that was just good enough to handle
our traffic but with too little headroom for comfort. We are getting ready
to test 8.3 and expect to migrate from MySQL to PostgreSQL for the backend
database at that time.

In our testing, we never had the query length or field length related
problems that we observed with the MySQL backend. I would be happy to
help with making the PostgreSQL more stable and performant. I would
also be interested in supporting the Markov training using a SQL backend
instead of the hash driver. Please let me know if I can help.

Cheers,
Ken Marshall

Reply via email to