On Fri, 06 Aug 2010 17:54:56 -0500 Nate Custer <ncus...@hostgator.com> wrote:
> Stevan, > Hello Nate, > I had ignored master<->master replication in mysql as it looked like a > lot of work. > it is not that much work. The only issue I have with it is that it is not that reliable. Using MySQL cluster is more suited but uses more memory then normal master <-> master replication. > I am currently using a mapping file on each of the servers > to split traffic for multiple domains to different spamboxes. If I can > get the sql db growth under control and make master<->replication work, > that would be a serious improvement in my overall design. > Getting the database grow under control is easy: * Use one of the intelligent tokenizers (aka OSB or SBPH). I would suggest you to use OSB since you plan to put the data in a MySQL database. OSB will produce MORE tokens then WORD or CHAIN but at the end you will still have less to process since the tokenizer is way more intelligent then WORD or CHAIN and learns much faster. That all together leads after a while to much less tokens then when using CHAIN or WORD. * For TrainingMode I would suggest to use TOE. The reason for TOE is: - It does not train tokens when classifying. It only changes token hits when doing training. * Use a merged group that you pre-train. But train intelligent if you can. I use my own training script that has implemented TONE (Train On error Or Near error) style training with an asymmetric thick threshold range and double side training. I completely avoid signature based training because that takes ages till the signature table is written and I don't need that since I have the original mail available and chan push that mail to DSPAM for training when needed. Currently I have hundreds of domains and my whole DSPAM database is slightly below 400MB. * If you can assure that the training will be done with the original mail (or slightly modified original mail (aka: added headers but not significant changes to the original mail)) then you could speed up DSPAM by setting TrainPristine to on. This will avoid the filling of the signature database and this is a significant speedup. You could then stay on TOE while having the benefit of a NOTRAIN but still keep the possibility of training. * Set a low MaxMessageSize. Spammers usually don't send messages in the megabyte range. Usually spam mail is below 64KB. Image spam is slightly bigger but still not that big. Choosing a right size here will lower the need to consult the storage backend and will increase your processing speed (no tokenization, etc) and decrease your false positive rate. * Are you going to block messages before-queue? Using something like a RBL/RHBL? If not then choose a good one and add it to DSPAM. Current code only allows one RBL but still. Using a good one will as well lower the need to write/read data to/from your storage backend. > Are you using > NFS for the /var/dspam data as well? > No. I use GlusterFS. NFS is to limiting for me. I need something that is fast as hell and allows me that a single anti-spam node can go down and not tear down the other nodes. With NFS I don't have that kind of scalability nor that kind of speed options and flexibility. The current caching options found in the more recent Linux Kernels help when using NFS but compared to GlusterFS it is not that flexible nor that fast. > I am currently doing TOE traning for all users, but I am seeing some > updates when I stress test (I think that may be the before 25k messages > filtered barrier being hit). > Yeah. I know. I could improve that. In fact I already did some changes to that part of the code in the current GIT version. I could push it slightly more and avoid updates in some corner cases when using TOE. > Right now no training should be happening > as the users don't know about dspam and should not be training. > Aha. Okay. So who is then training? > The my.cnf is not very interesting (but is attached below). The > dspam.conf is mostly full of the IgnoreHeader rules you provided > earlier. To make it a bit more readable I filtered those out. > Okay. I am going to write some comment about your dspam.conf down below. For MySQL configuration I need some time to give you a feedback. I hope that is okay? > As far as the OS, am running a stock CentOS 5.4 64 bit, with the stock > compiler options: > > [r...@vps2 mysql]# gcc -v > Using built-in specs. > Target: x86_64-redhat-linux > Configured with: ../configure --prefix=/usr --mandir=/usr/share/man > --infodir=/usr/share/info --enable-shared --enable-threads=posix > --enable-checking=release --with-system-zlib --enable-__cxa_atexit > --disable-libunwind-exceptions --enable-libgcj-multifile > --enable-languages=c,c++,objc,obj-c++,java,fortran,ada > --enable-java-awt=gtk --disable-dssi --enable-plugin > --with-java-home=/usr/lib/jvm/java-1.4.2-gcj-1.4.2.0/jre > --with-cpu=generic --host=x86_64-redhat-linux > Thread model: posix > gcc version 4.1.2 20080704 (Red Hat 4.1.2-48) > If speed is an issue then compile DSPAM and MySQL with the Intel C/C++ compiler. On my setup MySQL is about 20% to 30% faster when compiled with the Intel C/C++ compiler. I would say that DSPAM is faster too but I never really have messured it to give you any hard figures. I have not benchmarked GCC 4.5.0 against Intel C/C++ compiler but I would say that there is not that much of a difference (if at all). My feeling is that 4.5.0 is +/- on the same speed level as using the Intel C/C++ compiler. That's the reason why I currently am using GCC 4.5.0 for my production system. Be however prepared to do some hand work if you switch to a recent GCC version. For example MySQL had issues when compiled with GCC 4.5.0 and you had to patch the source in order to be able to compile it with GCC 4.5.0. If this is not an option for you then better go with the Intel C/C++ compiler. btw: Try out the GIT version. I have made some speed improvements in the GIT version that should minimize the load on your MySQL server. > Nate > -- Kind Regards from Switzerland, Stevan Bajić > -------------------------------------------------------------------------------------------------- > > dspam.conf: > [...] > WebStats on > Your users are not using the Web-UI. Right? Remove that option as it saves you a write to the stats file. [...] > > MySQLConnectionCache 8 > I personally would increase that number. [...] > > HashRecMax 98317 > HashAutoExtend on > HashMaxExtents 0 > HashExtentSize 49157 > HashPctIncrease 10 > HashMaxSeek 10 > HashConnectionCache 10 > Comment those hash driver things. They only eat up your memory (DSPAM needs to read them and process them and save them in memory) and you are not using any of them for your MySQL setup. [...] > [r...@vps2 mysql]# cat /etc/my.cnf > [mysqld] > tmpdir=/tmp > open_files_limit=33628 > > old_passwords=1 > datadir=/var/lib/mysql > skip-locking > safe-show-database > tmp_table_size = 256M > max_heap_table_size = 256M > query_cache_limit=32M > query_cache_size=96MB ## 32MB for every 1GB of RAM > query_cache_type=1 > max_user_connections=25 > max_connections=150 > collation_server=utf8_unicode_ci > character_set_server=utf8 > > > delayed_insert_timeout=40 > > interactive_timeout=10 > wait_timeout=3600 > connect_timeout=20 > thread_cache_size=128 > key_buffer=64M ## 32MB for every 1GB of RAM > join_buffer=1M > max_connect_errors=20 > max_allowed_packet=16M > table_cache=8196 > record_buffer=1M > sort_buffer_size=2M ## 1MB for every 1GB of RAM > read_buffer_size=2M ## 1MB for every 1GB of RAM > read_rnd_buffer_size=2M ## 1MB for every 1GB of RAM > thread_concurrency=8 ## Number of CPUs x 2 > myisam_sort_buffer_size=32M > server-id=1 > > pbxt-record-cache-size=1G > pbxt-index-cache-size=1G > > > [mysql.server] > user=mysql > > > [safe_mysqld] > open_files_limit=33628 > err-log=/var/log/mysqld.log > pid-file=/var/lib/mysql/mysql.pid > > [mysqldump] > quick > max_allowed_packet=16M > > [mysql] > no-auto-rehash > ------------------------------------------------------------------------------ This SF.net email is sponsored by Make an app they can't live without Enter the BlackBerry Developer Challenge http://p.sf.net/sfu/RIM-dev2dev _______________________________________________ Dspam-user mailing list Dspam-user@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspam-user