On Fri, 06 Aug 2010 17:54:56 -0500
Nate Custer <ncus...@hostgator.com> wrote:

> Stevan,
> 
Hello Nate,


> I had ignored master<->master replication in mysql as it looked like a 
> lot of work.
>
it is not that much work. The only issue I have with it is that it is not that 
reliable. Using MySQL cluster is more suited but uses more memory then normal 
master <-> master replication.


> I am currently using a mapping file on each of the servers 
> to split traffic for multiple domains to different spamboxes. If I can 
> get the sql db growth under control and make master<->replication work, 
> that would be a serious improvement in my overall design.
>
Getting the database grow under control is easy:

* Use one of the intelligent tokenizers (aka OSB or SBPH). I would suggest you 
to use OSB since you plan to put the data in a MySQL database. OSB will produce 
MORE tokens then WORD or CHAIN but at the end you will still have less to 
process since the tokenizer is way more intelligent then WORD or CHAIN and 
learns much faster. That all together leads after a while to much less tokens 
then when using CHAIN or WORD.

* For TrainingMode I would suggest to use TOE. The reason for TOE is:
  - It does not train tokens when classifying. It only changes token hits when 
doing training.

* Use a merged group that you pre-train. But train intelligent if you can. I 
use my own training script that has implemented TONE (Train On error Or Near 
error) style training with an asymmetric thick threshold range and double side 
training. I completely avoid signature based training because that takes ages 
till the signature table is written and I don't need that since I have the 
original mail available and chan push that mail to DSPAM for training when 
needed. Currently I have hundreds of domains and my whole DSPAM database is 
slightly below 400MB.

* If you can assure that the training will be done with the original mail (or 
slightly modified original mail (aka: added headers but not significant changes 
to the original mail)) then you could speed up DSPAM by setting TrainPristine 
to on. This will avoid the filling of the signature database and this is a 
significant speedup. You could then stay on TOE while having the benefit of a 
NOTRAIN but still keep the possibility of training.

* Set a low MaxMessageSize. Spammers usually don't send messages in the 
megabyte range. Usually spam mail is below 64KB. Image spam is slightly bigger 
but still not that big. Choosing a right size here will lower the need to 
consult the storage backend and will increase your processing speed (no 
tokenization, etc) and decrease your false positive rate.

* Are you going to block messages before-queue? Using something like a 
RBL/RHBL? If not then choose a good one and add it to DSPAM. Current code only 
allows one RBL but still. Using a good one will as well lower the need to 
write/read data to/from your storage backend.


> Are you using 
> NFS for the /var/dspam data as well?
> 
No. I use GlusterFS. NFS is to limiting for me. I need something that is fast 
as hell and allows me that a single anti-spam node can go down and not tear 
down the other nodes. With NFS I don't have that kind of scalability nor that 
kind of speed options and flexibility. The current caching options found in the 
more recent Linux Kernels help when using NFS but compared to GlusterFS it is 
not that flexible nor that fast.


> I am currently doing TOE traning for all users, but I am seeing some 
> updates when I stress test (I think that may be the before 25k messages 
> filtered barrier being hit).
>
Yeah. I know. I could improve that. In fact I already did some changes to that 
part of the code in the current GIT version. I could push it slightly more and 
avoid updates in some corner cases when using TOE.


> Right now no training should be happening 
> as the users don't know about dspam and should not be training.
> 
Aha. Okay. So who is then training?


> The my.cnf is not very interesting (but is attached below). The 
> dspam.conf is mostly full of the IgnoreHeader rules you provided 
> earlier. To make it a bit more readable I filtered those out.
> 
Okay. I am going to write some comment about your dspam.conf down below. For 
MySQL configuration I need some time to give you a feedback. I hope that is 
okay?


> As far as the OS, am running a stock CentOS 5.4 64 bit, with the stock 
> compiler options:
> 
> [r...@vps2 mysql]# gcc -v
> Using built-in specs.
> Target: x86_64-redhat-linux
> Configured with: ../configure --prefix=/usr --mandir=/usr/share/man 
> --infodir=/usr/share/info --enable-shared --enable-threads=posix 
> --enable-checking=release --with-system-zlib --enable-__cxa_atexit 
> --disable-libunwind-exceptions --enable-libgcj-multifile 
> --enable-languages=c,c++,objc,obj-c++,java,fortran,ada 
> --enable-java-awt=gtk --disable-dssi --enable-plugin 
> --with-java-home=/usr/lib/jvm/java-1.4.2-gcj-1.4.2.0/jre 
> --with-cpu=generic --host=x86_64-redhat-linux
> Thread model: posix
> gcc version 4.1.2 20080704 (Red Hat 4.1.2-48)
> 
If speed is an issue then compile DSPAM and MySQL with the Intel C/C++ 
compiler. On my setup MySQL is about 20% to 30% faster when compiled with the 
Intel C/C++ compiler. I would say that DSPAM is faster too but I never really 
have messured it to give you any hard figures.

I have not benchmarked GCC 4.5.0 against Intel C/C++ compiler but I would say 
that there is not that much of a difference (if at all). My feeling is that 
4.5.0 is +/- on the same speed level as using the Intel C/C++ compiler. That's 
the reason why I currently am using GCC 4.5.0 for my production system. Be 
however prepared to do some hand work if you switch to a recent GCC version. 
For example MySQL had issues when compiled with GCC 4.5.0 and you had to patch 
the source in order to be able to compile it with GCC 4.5.0. If this is not an 
option for you then better go with the Intel C/C++ compiler.


btw: Try out the GIT version. I have made some speed improvements in the GIT 
version that should minimize the load on your MySQL server.


> Nate
>
-- 
Kind Regards from Switzerland,

Stevan Bajić

 
> --------------------------------------------------------------------------------------------------
> 
> dspam.conf:
> 
[...]


> WebStats on
> 
Your users are not using the Web-UI. Right? Remove that option as it saves you 
a write to the stats file.


[...]
> 
> MySQLConnectionCache    8
> 
I personally would increase that number.


[...]
> 
> HashRecMax              98317
> HashAutoExtend          on
> HashMaxExtents          0
> HashExtentSize          49157
> HashPctIncrease         10
> HashMaxSeek             10
> HashConnectionCache     10
> 
Comment those hash driver things. They only eat up your memory (DSPAM needs to 
read them and process them and save them in memory) and you are not using any 
of them for your MySQL setup.

[...]


> [r...@vps2 mysql]# cat /etc/my.cnf
> [mysqld]
> tmpdir=/tmp
> open_files_limit=33628
> 
> old_passwords=1
> datadir=/var/lib/mysql
> skip-locking
> safe-show-database
> tmp_table_size = 256M
> max_heap_table_size = 256M
> query_cache_limit=32M
> query_cache_size=96MB ## 32MB for every 1GB of RAM
> query_cache_type=1
> max_user_connections=25
> max_connections=150
> collation_server=utf8_unicode_ci
> character_set_server=utf8
> 
> 
> delayed_insert_timeout=40
> 
> interactive_timeout=10
> wait_timeout=3600
> connect_timeout=20
> thread_cache_size=128
> key_buffer=64M ## 32MB for every 1GB of RAM
> join_buffer=1M
> max_connect_errors=20
> max_allowed_packet=16M
> table_cache=8196
> record_buffer=1M
> sort_buffer_size=2M ## 1MB for every 1GB of RAM
> read_buffer_size=2M ## 1MB for every 1GB of RAM
> read_rnd_buffer_size=2M  ## 1MB for every 1GB of RAM
> thread_concurrency=8 ## Number of CPUs x 2
> myisam_sort_buffer_size=32M
> server-id=1
> 
> pbxt-record-cache-size=1G
> pbxt-index-cache-size=1G
> 
> 
> [mysql.server]
> user=mysql
> 
> 
> [safe_mysqld]
> open_files_limit=33628
> err-log=/var/log/mysqld.log
> pid-file=/var/lib/mysql/mysql.pid
> 
> [mysqldump]
> quick
> max_allowed_packet=16M
> 
> [mysql]
> no-auto-rehash
> 

------------------------------------------------------------------------------
This SF.net email is sponsored by 

Make an app they can't live without
Enter the BlackBerry Developer Challenge
http://p.sf.net/sfu/RIM-dev2dev 
_______________________________________________
Dspam-user mailing list
Dspam-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspam-user

Reply via email to