On 18.04.2012 23:40, Ben wrote:
> Thanks for the explanation Stevan.
>
> Now that is appears it is worth switching, my next question is how best
> to do the switch for already existing users trained again TEFT:
>
> If I just change the dspam setting, what happens?
Allow me to explain with more detail.


Assume a message has the following tokens (in clear text):
Tanks
for
the
explanation
Stevan


And assume those tokens are in DSPAM learned as NOT SPAM then the tokens 
would look +/- like this in the storage backend (again: token are in 
clear text):
+-----+----------------+-----------+---------------+------------+
| uid | token          | spam_hits | innocent_hits | last_hit   |
+-----+----------------+-----------+---------------+------------+
|   1 |         Thanks |         0 |             1 | 2012-04-18 |
|   1 |            for |         0 |             1 | 2012-04-18 |
|   1 |            the |         0 |             1 | 2012-04-18 |
|   1 |    explanation |         0 |             1 | 2012-04-18 |
|   1 |         Stevan |         0 |             1 | 2012-04-18 |
+-----+----------------+-----------+---------------+------------+

And stats would look as follow:
+-----+--------------+------------------+--------------------+------------------------+----------------+--------------------+-----------------+---------------------+
| uid | spam_learned | innocent_learned | spam_misclassified | 
innocent_misclassified | spam_corpusfed | innocent_corpusfed | 
spam_classified | innocent_classified |
+-----+--------------+------------------+--------------------+------------------------+----------------+--------------------+-----------------+---------------------+
|   1 |            0 |                1 |                  0 
|                      0 |              0 |                  0 
|               0 |                   1 |
+-----+--------------+------------------+--------------------+------------------------+----------------+--------------------+-----------------+---------------------+


Now assume you stay on TEFT and assume you get a message with the body 
"Thanks for the explanation Stevan". Assume you get that message 10 
times then the data in the storage backend table would look as follow...


.... with TEFT:

Tokens:
+-----+----------------+-----------+---------------+------------+
| uid | token          | spam_hits | innocent_hits | last_hit   |
+-----+----------------+-----------+---------------+------------+
|   1 |         Thanks |         0 |            11 | 2012-04-18 |
|   1 |            for |         0 |            11 | 2012-04-18 |
|   1 |            the |         0 |            11 | 2012-04-18 |
|   1 |    explanation |         0 |            11 | 2012-04-18 |
|   1 |         Stevan |         0 |            11 | 2012-04-18 |
+-----+----------------+-----------+---------------+------------+

Stats:
+-----+--------------+------------------+--------------------+------------------------+----------------+--------------------+-----------------+---------------------+
| uid | spam_learned | innocent_learned | spam_misclassified | 
innocent_misclassified | spam_corpusfed | innocent_corpusfed | 
spam_classified | innocent_classified |
+-----+--------------+------------------+--------------------+------------------------+----------------+--------------------+-----------------+---------------------+
|   1 |            0 |               11 |                  0 
|                      0 |              0 |                  0 
|               0 |                  11 |
+-----+--------------+------------------+--------------------+------------------------+----------------+--------------------+-----------------+---------------------+


.... with TOE:

Tokens:
+-----+----------------+-----------+---------------+------------+
| uid | token          | spam_hits | innocent_hits | last_hit   |
+-----+----------------+-----------+---------------+------------+
|   1 |         Thanks |         0 |             1 | 2012-04-18 |
|   1 |            for |         0 |             1 | 2012-04-18 |
|   1 |            the |         0 |             1 | 2012-04-18 |
|   1 |    explanation |         0 |             1 | 2012-04-18 |
|   1 |         Stevan |         0 |             1 | 2012-04-18 |
+-----+----------------+-----------+---------------+------------+

Stats:
+-----+--------------+------------------+--------------------+------------------------+----------------+--------------------+-----------------+---------------------+
| uid | spam_learned | innocent_learned | spam_misclassified | 
innocent_misclassified | spam_corpusfed | innocent_corpusfed | 
spam_classified | innocent_classified |
+-----+--------------+------------------+--------------------+------------------------+----------------+--------------------+-----------------+---------------------+
|   1 |            0 |                1 |                  0 
|                      0 |              0 |                  0 
|               0 |                  11 |
+-----+--------------+------------------+--------------------+------------------------+----------------+--------------------+-----------------+---------------------+


Do you see the difference? Does the above example help you to answer 
your own question?




>   Does it start over
> with no training data?
I think you can give that answer yourself after reading the above 
example. (hint: No! It does not start from the beginning).

>   Convert the old data?
What do you think after reading the above example? Does it convert old 
data? What does it convert (if it does a conversation)?


>   Do some hybrid system where
> new information is trained as TOE but it keeps the TEFT data too?
Can you define what you consider to be data?

>   Or
> should I just wipe the user from dspam and start anew? Maybe trying to
> train with some recent spam.
NO! I don't think YOU need to do that. The reason I am saying this is 
that you explicitly asked if you should switch to TOE even if most of 
your users are in training mode. So I assume you have anyway a very 
young/fresh DSPAM installation. Correct me if I am wrong.


If you really want to start from the beginning then do it. I personally 
would suggest you to use a globally merged group. My global merged group 
allows me to add new users to my DSPAM installation and have them a 
99.x% catch rate from day one. And monthly I train that merged group 
with new SPAM corpi and with HAM data too. I try to keep the training of 
that global merged group to a minimum by running monthly some 
classification tests against Spam and Ham messages (using only merged 
group data) and if I see a very low catch rate or a high FP/FN rate then 
I train more intensive.


> Thanks,
>
> Ben

-- 
Kind Regards from Switzerland,

Stevan Bajić

> On 4/18/2012 3:23 PM, Stevan Bajić wrote:
>> On 18.04.2012 22:37, Ben Luey wrote:
>>> I setup dspam a while ago with TEFT. Everything I've read on the list
>>> says to use TOE instead of TEFT. Once the training period is over
>>> (>2,500 messages I believe) does it matter?
>> Yes it does!
>>
>>
>>>     Does TOE vs TEFT only affect
>>> the spam detection when in training mode?
>> No! It affects every processing.
>> If you have TEFT then every token in the storage backend will be
>> modified on every single processed message (except on whitelisted,
>> blocklisted, blacklisted and virus messages) and the statistics for the
>> user (TP/TN count) will be changed too.
>>
>> TOE will on the other hand only change the statistics for the user
>> (TP/TN count).
>>
>>
>>> Put another way, if none of my users are still in training mode, is it
>>> worth switching?
>> YES! Internally DSPAM is anyway working slightly differently while in
>> training mode. So switching now to TOE does not have any negative or
>> positive effect (if you are really still in training mode).
>>
>>>     Or should I just change the default for new users?
>> You can change it already now. I would suggest you to change it already
>> now so that you don't have to think about it in the future.
>>
>>
>>> Thanks,
>>>
>>> Ben
>>>
>
>



------------------------------------------------------------------------------
Better than sec? Nothing is better than sec when it comes to
monitoring Big Data applications. Try Boundary one-second 
resolution app monitoring today. Free.
http://p.sf.net/sfu/Boundary-dev2dev
_______________________________________________
Dspam-user mailing list
Dspam-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspam-user

Reply via email to