On 18.04.2012 23:40, Ben wrote: > Thanks for the explanation Stevan. > > Now that is appears it is worth switching, my next question is how best > to do the switch for already existing users trained again TEFT: > > If I just change the dspam setting, what happens? Allow me to explain with more detail.
Assume a message has the following tokens (in clear text): Tanks for the explanation Stevan And assume those tokens are in DSPAM learned as NOT SPAM then the tokens would look +/- like this in the storage backend (again: token are in clear text): +-----+----------------+-----------+---------------+------------+ | uid | token | spam_hits | innocent_hits | last_hit | +-----+----------------+-----------+---------------+------------+ | 1 | Thanks | 0 | 1 | 2012-04-18 | | 1 | for | 0 | 1 | 2012-04-18 | | 1 | the | 0 | 1 | 2012-04-18 | | 1 | explanation | 0 | 1 | 2012-04-18 | | 1 | Stevan | 0 | 1 | 2012-04-18 | +-----+----------------+-----------+---------------+------------+ And stats would look as follow: +-----+--------------+------------------+--------------------+------------------------+----------------+--------------------+-----------------+---------------------+ | uid | spam_learned | innocent_learned | spam_misclassified | innocent_misclassified | spam_corpusfed | innocent_corpusfed | spam_classified | innocent_classified | +-----+--------------+------------------+--------------------+------------------------+----------------+--------------------+-----------------+---------------------+ | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | +-----+--------------+------------------+--------------------+------------------------+----------------+--------------------+-----------------+---------------------+ Now assume you stay on TEFT and assume you get a message with the body "Thanks for the explanation Stevan". Assume you get that message 10 times then the data in the storage backend table would look as follow... .... with TEFT: Tokens: +-----+----------------+-----------+---------------+------------+ | uid | token | spam_hits | innocent_hits | last_hit | +-----+----------------+-----------+---------------+------------+ | 1 | Thanks | 0 | 11 | 2012-04-18 | | 1 | for | 0 | 11 | 2012-04-18 | | 1 | the | 0 | 11 | 2012-04-18 | | 1 | explanation | 0 | 11 | 2012-04-18 | | 1 | Stevan | 0 | 11 | 2012-04-18 | +-----+----------------+-----------+---------------+------------+ Stats: +-----+--------------+------------------+--------------------+------------------------+----------------+--------------------+-----------------+---------------------+ | uid | spam_learned | innocent_learned | spam_misclassified | innocent_misclassified | spam_corpusfed | innocent_corpusfed | spam_classified | innocent_classified | +-----+--------------+------------------+--------------------+------------------------+----------------+--------------------+-----------------+---------------------+ | 1 | 0 | 11 | 0 | 0 | 0 | 0 | 0 | 11 | +-----+--------------+------------------+--------------------+------------------------+----------------+--------------------+-----------------+---------------------+ .... with TOE: Tokens: +-----+----------------+-----------+---------------+------------+ | uid | token | spam_hits | innocent_hits | last_hit | +-----+----------------+-----------+---------------+------------+ | 1 | Thanks | 0 | 1 | 2012-04-18 | | 1 | for | 0 | 1 | 2012-04-18 | | 1 | the | 0 | 1 | 2012-04-18 | | 1 | explanation | 0 | 1 | 2012-04-18 | | 1 | Stevan | 0 | 1 | 2012-04-18 | +-----+----------------+-----------+---------------+------------+ Stats: +-----+--------------+------------------+--------------------+------------------------+----------------+--------------------+-----------------+---------------------+ | uid | spam_learned | innocent_learned | spam_misclassified | innocent_misclassified | spam_corpusfed | innocent_corpusfed | spam_classified | innocent_classified | +-----+--------------+------------------+--------------------+------------------------+----------------+--------------------+-----------------+---------------------+ | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 11 | +-----+--------------+------------------+--------------------+------------------------+----------------+--------------------+-----------------+---------------------+ Do you see the difference? Does the above example help you to answer your own question? > Does it start over > with no training data? I think you can give that answer yourself after reading the above example. (hint: No! It does not start from the beginning). > Convert the old data? What do you think after reading the above example? Does it convert old data? What does it convert (if it does a conversation)? > Do some hybrid system where > new information is trained as TOE but it keeps the TEFT data too? Can you define what you consider to be data? > Or > should I just wipe the user from dspam and start anew? Maybe trying to > train with some recent spam. NO! I don't think YOU need to do that. The reason I am saying this is that you explicitly asked if you should switch to TOE even if most of your users are in training mode. So I assume you have anyway a very young/fresh DSPAM installation. Correct me if I am wrong. If you really want to start from the beginning then do it. I personally would suggest you to use a globally merged group. My global merged group allows me to add new users to my DSPAM installation and have them a 99.x% catch rate from day one. And monthly I train that merged group with new SPAM corpi and with HAM data too. I try to keep the training of that global merged group to a minimum by running monthly some classification tests against Spam and Ham messages (using only merged group data) and if I see a very low catch rate or a high FP/FN rate then I train more intensive. > Thanks, > > Ben -- Kind Regards from Switzerland, Stevan Bajić > On 4/18/2012 3:23 PM, Stevan Bajić wrote: >> On 18.04.2012 22:37, Ben Luey wrote: >>> I setup dspam a while ago with TEFT. Everything I've read on the list >>> says to use TOE instead of TEFT. Once the training period is over >>> (>2,500 messages I believe) does it matter? >> Yes it does! >> >> >>> Does TOE vs TEFT only affect >>> the spam detection when in training mode? >> No! It affects every processing. >> If you have TEFT then every token in the storage backend will be >> modified on every single processed message (except on whitelisted, >> blocklisted, blacklisted and virus messages) and the statistics for the >> user (TP/TN count) will be changed too. >> >> TOE will on the other hand only change the statistics for the user >> (TP/TN count). >> >> >>> Put another way, if none of my users are still in training mode, is it >>> worth switching? >> YES! Internally DSPAM is anyway working slightly differently while in >> training mode. So switching now to TOE does not have any negative or >> positive effect (if you are really still in training mode). >> >>> Or should I just change the default for new users? >> You can change it already now. I would suggest you to change it already >> now so that you don't have to think about it in the future. >> >> >>> Thanks, >>> >>> Ben >>> > > ------------------------------------------------------------------------------ Better than sec? Nothing is better than sec when it comes to monitoring Big Data applications. Try Boundary one-second resolution app monitoring today. Free. http://p.sf.net/sfu/Boundary-dev2dev _______________________________________________ Dspam-user mailing list Dspam-user@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspam-user