On Fri, 18 Dec 2009 00:58:04 +0100 Frantisek Hanzlik <[email protected]> wrote:
> Stevan Bajić wrote: > > On Thu, 17 Dec 2009 18:28:58 +0100 > > Frantisek Hanzlik<[email protected]> wrote: > > > >> I want upgrade several DSPAM installation, all of them use hash driver, > >> to 3.9.0. Is there any suggestion? Is possible use old databases, or > >> it is not recommended? > >> > > You can use old databases without issues. > > > > > >> Maybe, because of different (better) charset decoding (important for > >> me, as in Czech are used utf8, 8859-2, cp1250,.. codings) and html > >> parsing in 3.9.0, there is better throw away old databases and create > >> new, probably with corpus training utilizing? > >> > > Since you are using the Hash driver any training you would want to do > > can only be on a per user basis since the Hash driver does not have > > DSPAM-groups support. > > Hello Stevan, > Ahoi Frantisek, > how I have understand this (Hash driver does not have DSPAM-groups support) ? > Semi correct. Everything that involves reading more then one database/css does not work with th Hash driver. > README says, that hash driver not support merged groups, but other are > probably OK, yes? > I need to look deeper into the code but as far as I remember anything that involves reading more then just one database/css file does not work. > In my configurations I mailnly use "shared,managed" or > "shared" groups and it work fine. > Shared is just using ONE single css file for a bunch of users. That should work with the Hash driver. > Or isn't possible use dspan-train script for DSPAM pre-training? > Yes, yes. It is possible to use the dspam_train script to pretrain the Hash driver. > And, in dspam sources is scripts/train.pl script, for which purposes is it? > That is an older version of dspam_train that is far, far, far behind the current dspam_train in terms of functionality and in terms of used DSPAM functions (for example it does not handle blocklist, blacklist, etc). You can use that script if you want or use dspam_train or make your own training script. I for example use my own made script that is using TONE (Train on Error or Near Error) with additional features like asymetric treshold/thickness for the spam/ham training, double side training (this is essencial for the Hyperspace classifier in CRM114 and I find that idea good so I implemented it into my training script as well), etc... Most of the ideas about how to train the correct way came up after using CRM114/OSBF-Lua for many years. My script is as well by factors faster then the original dspam_train since I don't use signature based training (so I don't need to purge signatures after a long training run) and other small things that I need because I use the script to feed fresh data to my DSPAM instance that I have captured on my SPAM honeypot. I needed that additional functionallity because all training is done automatic without my own intervention and I need the script to be rock solid and to continue running even if some mails are producing erros in DSPAM while doing the training. Currently I have the following options: ---------------------------------------------------------------- theia spam-stuff # ./dspam_train_tone_v5 --help ERROR: spam corpus must be path to maildir directory or MBOX file. Usage: ./dspam_train_tone_v5 [[username]|[--user username]] User name to use for training [--client] To run in client mode [--random] Randomly process corpi [--refute] To unlearn errors from opposite class [--subject] To show subject from error/unlearn/TONE [--max-retrain max_retrain] Maximum relearns per error/TONE [--spam-threshold threshold] TONE Spam threshold [--ham-threshold threshold] TONE Ham threshold [--overleap count] Overleap certain count of messages [--stop-after count] Stop after processed certain count of messages [[-i index]|[spam_dir] [nonspam_dir]] theia spam-stuff # ---------------------------------------------------------------- > > > > I would say that you should keep the old databases and run daily the > > clean process (cssclean/csscompress) to purge old tokens from the database. > > Soon or later the old unused tokens will vanish from the database and you > > will only have new tokens. > > > > As soon as you use 3.9.0 your users will benefit from the different (better) > > charset decoding and html parsing. Purging/removing the database will not > > affect that capability in any negative nor in any positive way. > > > > Well, I understand. I wanted try pre-train dspam from prepared spam and ham > corpus, as I expect slightly better accuracy in addition to start with > 3.9.0-fine CSS, especially on lazy users, which not train dspam fairly. > Then you should definatley use TOE or TUM but NOT TEFT. I mean in production. For training you can use whatever you think is best for you. > Sorry for my terrible english. > Není žádný problém > Thanks, Franta Hanzlík > -- Kind Regards from Switzerland, Stevan Bajić ------------------------------------------------------------------------------ This SF.Net email is sponsored by the Verizon Developer Community Take advantage of Verizon's best-in-class app development support A streamlined, 14 day to market process makes app distribution fast and easy Join now and get one step closer to millions of Verizon customers http://p.sf.net/sfu/verizon-dev2dev _______________________________________________ Dspam-user mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dspam-user
