Re: [Dspam-user] how about upgrading hash DB to 3.9.0 ?

Frantisek Hanzlik Thu, 17 Dec 2009 19:57:32 -0800

Stevan Bajić wrote:
> On Fri, 18 Dec 2009 00:58:04 +0100
> Frantisek Hanzlik<[email protected]>  wrote:
>
>> Stevan Bajić wrote:
>>> On Thu, 17 Dec 2009 18:28:58 +0100
>>> Frantisek Hanzlik<[email protected]>   wrote:
>>>
>>>> I want upgrade several DSPAM installation, all of them use hash driver,
>>>> to 3.9.0. Is there any suggestion? Is possible use old databases, or
>>>> it is not recommended?
>>>>
>>> You can use old databases without issues.
>>>
>>>
>>>> Maybe, because of different (better) charset decoding (important for
>>>> me, as in Czech are used utf8, 8859-2, cp1250,.. codings) and html
>>>> parsing in 3.9.0, there is better throw away old databases and create
>>>> new, probably with corpus training utilizing?
>>>>
>>> Since you are using the Hash driver any training you would want to do
>>   >  can only be on a per user basis since the Hash driver does not have
>>   >  DSPAM-groups support.
>>
>> Hello Stevan,
>>
> Ahoi Frantisek,
>
>> how I have understand this (Hash driver does not have DSPAM-groups support) ?
>>
> Semi correct. Everything that involves reading more then one database/css 
> does not work with th Hash driver.


Aha. Then with hash driver isn't probably possible use merged and 
classification groups and maybe inoculation group, but shared should be fine.


>
>> README says, that hash driver not support merged groups, but other are
>> probably OK, yes?
>>
> I need to look deeper into the code but as far as I remember anything that 
> involves reading more then just one database/css file does not work.
>
>
>> In my configurations I mailnly use "shared,managed" or
>> "shared" groups and it work fine.
>>
> Shared is just using ONE single css file for a bunch of users. That should 
> work with the Hash driver.
>
>
>> Or isn't possible use dspan-train script for DSPAM pre-training?
>>
> Yes, yes. It is possible to use the dspam_train script to pretrain the Hash 
> driver.
>
>
>> And, in dspam sources is scripts/train.pl script, for which purposes is it?
>>
> That is an older version of dspam_train that is far, far, far behind the 
> current
 > dspam_train in terms of functionality and in terms of used DSPAM functions 
 > (for
 > example it does not handle blocklist, blacklist, etc). You can use that 
script if
 > you want or use dspam_train or make your own training script. I for example 
use my
 > own made script that is using TONE (Train on Error or Near Error) with 
additional
 > features like asymetric treshold/thickness for the spam/ham training, 
double side
 > training (this is essencial for the Hyperspace classifier in CRM114 and I 
find that
 > idea good so I implemented it into my training script as well), etc... Most 
of the
 > ideas about how to train the correct way came up after using 
CRM114/OSBF-Lua for
 > many years. My script is as well by factors faster then the original 
dspam_train
 > since I don't use signature based training (so I don't need to purge 
signatures after
 > a long training run) and other small things that I need because I use the 
script to
 > feed fresh data to my DSPAM instance that I have captured on my SPAM 
 > honeypot.
 > I needed that additional functionallity because all training is done 
automatic without
 > my own intervention and I need the script to be rock solid and to continue 
running even
 > if some mails are producing erros in DSPAM while doing the training.
 > Currently I have the following options:
> ----------------------------------------------------------------
> theia spam-stuff # ./dspam_train_tone_v5 --help
> ERROR: spam corpus must be path to maildir directory or MBOX file.
>
> Usage: ./dspam_train_tone_v5
>    [[username]|[--user username]] User name to use for training
>    [--client]                     To run in client mode
>    [--random]                     Randomly process corpi
>    [--refute]                     To unlearn errors from opposite class
>    [--subject]                    To show subject from error/unlearn/TONE
>    [--max-retrain max_retrain]    Maximum relearns per error/TONE
>    [--spam-threshold threshold]   TONE Spam threshold
>    [--ham-threshold threshold]    TONE Ham threshold
>    [--overleap count]             Overleap certain count of messages
>    [--stop-after count]           Stop after processed certain count of 
> messages
>    [[-i index]|[spam_dir] [nonspam_dir]]
>
> theia spam-stuff #
> ----------------------------------------------------------------

Eh, I must admit, I not well understand all of these finest theory.



>>> I would say that you should keep the old databases and run daily  the
>>   >  clean process (cssclean/csscompress) to purge old tokens from the 
>> database.
>>   >  Soon or later the old unused tokens will vanish from the database and 
>> you
>>   >  will only have new tokens.
>>>
>>> As soon as you use 3.9.0 your users will benefit from the different (better)
>>   >  charset decoding and html parsing. Purging/removing the database will 
>> not
>>   >  affect that capability in any negative nor in any positive way.
>>>
>>
>> Well, I understand. I wanted try pre-train dspam from prepared spam and ham
>> corpus, as I expect slightly better accuracy in addition to start with
>> 3.9.0-fine CSS, especially on lazy users, which not train dspam fairly.
>>
> Then you should definatley use TOE or TUM but NOT TEFT. I mean in production.
 > For training you can use whatever you think is best for you.

Yes, after some training I commonly switch to TOE. README suggest it too, when
there are doing databases cleanings.

>> Sorry for my terrible english.
>>
> Není žádný problém

Yes, I know, not for You, but for me yes. But at least so.
When we touch it - not know when You registerd it, I sent before yesterday 
Czech webui translation via bugtracker system. What is untranslated (beside
some shorcuts in nav_admin_user.html, which is probably better leave it as is) 
is button "Tweak -1" in nav_performance.html. Can You please briefly explain 
its function?

Thanks, Franta

------------------------------------------------------------------------------
This SF.Net email is sponsored by the Verizon Developer Community
Take advantage of Verizon's best-in-class app development support
A streamlined, 14 day to market process makes app distribution fast and easy
Join now and get one step closer to millions of Verizon customers
http://p.sf.net/sfu/verizon-dev2dev 
_______________________________________________
Dspam-user mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspam-user

Re: [Dspam-user] how about upgrading hash DB to 3.9.0 ?

Reply via email to