Re: [Dspam-user] how about upgrading hash DB to 3.9.0 ?

Stevan Bajić Thu, 17 Dec 2009 17:19:33 -0800

On Fri, 18 Dec 2009 00:58:04 +0100
Frantisek Hanzlik <[email protected]> wrote:


> Stevan Bajić wrote:
> > On Thu, 17 Dec 2009 18:28:58 +0100
> > Frantisek Hanzlik<[email protected]>  wrote:
> >
> >> I want upgrade several DSPAM installation, all of them use hash driver,
> >> to 3.9.0. Is there any suggestion? Is possible use old databases, or
> >> it is not recommended?
> >>
> > You can use old databases without issues.
> >
> >
> >> Maybe, because of different (better) charset decoding (important for
> >> me, as in Czech are used utf8, 8859-2, cp1250,.. codings) and html
> >> parsing in 3.9.0, there is better throw away old databases and create
> >> new, probably with corpus training utilizing?
> >>
> > Since you are using the Hash driver any training you would want to do
>  > can only be on a per user basis since the Hash driver does not have
>  > DSPAM-groups support.
> 
> Hello Stevan,
> 
Ahoi Frantisek,

> how I have understand this (Hash driver does not have DSPAM-groups support) ?
>
Semi correct. Everything that involves reading more then one database/css does 
not work with th Hash driver.


> README says, that hash driver not support merged groups, but other are
> probably OK, yes?
>
I need to look deeper into the code but as far as I remember anything that 
involves reading more then just one database/css file does not work.


> In my configurations I mailnly use "shared,managed" or
> "shared" groups and it work fine.
>
Shared is just using ONE single css file for a bunch of users. That should work 
with the Hash driver.


> Or isn't possible use dspan-train script for DSPAM pre-training?
> 
Yes, yes. It is possible to use the dspam_train script to pretrain the Hash 
driver.


> And, in dspam sources is scripts/train.pl script, for which purposes is it?
> 
That is an older version of dspam_train that is far, far, far behind the 
current dspam_train in terms of functionality and in terms of used DSPAM 
functions (for example it does not handle blocklist, blacklist, etc). You can 
use that script if you want or use dspam_train or make your own training 
script. I for example use my own made script that is using TONE (Train on Error 
or Near Error) with additional features like asymetric treshold/thickness for 
the spam/ham training, double side training (this is essencial for the 
Hyperspace classifier in CRM114 and I find that idea good so I implemented it 
into my training script as well), etc... Most of the ideas about how to train 
the correct way came up after using CRM114/OSBF-Lua for many years. My script 
is as well by factors faster then the original dspam_train since I don't use 
signature based training (so I don't need to purge signatures after a long 
training run) and other small things that I need because I use the script to 
feed fresh data to my DSPAM instance that I have captured on my SPAM honeypot. 
I needed that additional functionallity because all training is done automatic 
without my own intervention and I need the script to be rock solid and to 
continue running even if some mails are producing erros in DSPAM while doing 
the training. Currently I have the following options:
----------------------------------------------------------------
theia spam-stuff # ./dspam_train_tone_v5 --help
ERROR: spam corpus must be path to maildir directory or MBOX file.

Usage: ./dspam_train_tone_v5
  [[username]|[--user username]] User name to use for training
  [--client]                     To run in client mode
  [--random]                     Randomly process corpi
  [--refute]                     To unlearn errors from opposite class
  [--subject]                    To show subject from error/unlearn/TONE
  [--max-retrain max_retrain]    Maximum relearns per error/TONE
  [--spam-threshold threshold]   TONE Spam threshold
  [--ham-threshold threshold]    TONE Ham threshold
  [--overleap count]             Overleap certain count of messages
  [--stop-after count]           Stop after processed certain count of messages
  [[-i index]|[spam_dir] [nonspam_dir]]

theia spam-stuff #
----------------------------------------------------------------


> >
> > I would say that you should keep the old databases and run daily  the
>  > clean process (cssclean/csscompress) to purge old tokens from the database.
>  > Soon or later the old unused tokens will vanish from the database and you
>  > will only have new tokens.
> >
> > As soon as you use 3.9.0 your users will benefit from the different (better)
>  > charset decoding and html parsing. Purging/removing the database will not
>  > affect that capability in any negative nor in any positive way.
> >
> 
> Well, I understand. I wanted try pre-train dspam from prepared spam and ham
> corpus, as I expect slightly better accuracy in addition to start with
> 3.9.0-fine CSS, especially on lazy users, which not train dspam fairly.
>
Then you should definatley use TOE or TUM but NOT TEFT. I mean in production. 
For training you can use whatever you think is best for you.


> Sorry for my terrible english.
> 
Není žádný problém


> Thanks, Franta Hanzlík
> 
-- 
Kind Regards from Switzerland,

Stevan Bajić

------------------------------------------------------------------------------
This SF.Net email is sponsored by the Verizon Developer Community
Take advantage of Verizon's best-in-class app development support
A streamlined, 14 day to market process makes app distribution fast and easy
Join now and get one step closer to millions of Verizon customers
http://p.sf.net/sfu/verizon-dev2dev 
_______________________________________________
Dspam-user mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspam-user

Re: [Dspam-user] how about upgrading hash DB to 3.9.0 ?

Reply via email to