On 18.04.2012 22:38, Steve Fatula wrote:

    *From:* Bradley Giesbrecht <bradley.giesbre...@gmail.com>
    *To:* Steve Fatula <compconsult...@yahoo.com>
    *Cc:* Dspam List <dspam-user@lists.sourceforge.net>
    *Sent:* Wednesday, April 18, 2012 3:04 PM
    *Subject:* Re: [Dspam-user] Increase Spam Hit Rate

    I can't help you other then to point out that you may have missed
    the two replies prior to the one you responded to, both of which
    suggested switching from 'TrainingMode TEFT' to 'TrainingMode TOE'.

Ok, so, what I had tried to ask for and was hoping to get was some sort of explanation as to why this might be the case when so many use TEFT, yet, I should try TOE.
English is not my native language and explaining something so technical in English instead of my native language is not always as easy as one might thing. But here we go... I will try to explain what TEFT is and why TOE is better.

TEFT stands for 'train everything'. Most users will tell you that it stands for 'train every fucking time'.
TOE stands for 'train on error'.

Allow me to ask how you self learn? I mean you Steve Fatula. How do you learn? How did you learn as kid that 2 plus 2 is equal to 4? Probably you learned it once and since then it was in your mind. Probably in the beginning you learned it symbolically that the 'picture' '2+2' is '4'. And later you learned that '+' is an addition and you learned how that '+' is working and you learned the numbers and after you learned that logic/mechanism you where able to sum almost any number with almost any other number. Right?

After that you did not needed any more to learn how addition worked. Right? Until that moment where some one told you to sum '(-20) + 3'. You probably got it wrong in the first place and then you learned how to do it right. Until someone asked you to sum '(-10)+(-30)'. Probably you got that wrong in the first place too and then you learned how to sum multiple negative numbers and after that you where able to master that too. Right?

This all above stands for the way how TOE works. It learns and is happy till it makes an error and then it learns from its own errors.

TEFT on the other hand is learning constantly. Even the right answers. It learns and learns and learns and learns.

One could now say that since TEFT is learning constantly that it is improving constantly. But this is not the case. Learning is good but TEFT is very easy over learning. In the beginning every one was thinking: more learning = more catching spam

But today we know that this is not right. Imagine your brain would work the same then you would almost not be able to exist. You would even not be able to just read this message here. Instead of just reading the letters and words you would LEARN the letters and words. Yeah. Learn them and read them. The same learning as you did when you where a kid and first started to learn read.

Another aspect of TEFT that is bad is the fact that most users never or rarely train. So since TEFT is constantly learning it will constantly learn WRONG things if the users don't correct. Allow me to explain:

1) 1+1=2
2) 1+2=3
3) 2*3=6
4) 7-2=4
5) 3*3=4

1, 2 and 3 are mathematically correct while 4 and 5 are wrong.

Now lets say that a user is running TEFT and that DSPAM is saying for all the messages (1 to 5) that they are mathematically correct. So in that case DSPAM would relearn 1, 2 and 3 (making that response stronger) and learn WRONGLY that '7-2=4' and that '3*3=4'.

Now lets say that a user is running TOE and that DSPAM is saying for all the messages (1 to 5) that they are mathematically correct. In that case DSPAM would NOT LEARN ANYTHING. It will not wrongly learn anything.

This is a huge difference!

I real world most user are very lousy trainer. So even if they don't train they constantly are making their token data less accurate if they use TEFT. Would they have used TOE then they would not learned wrong things. Their data would get less accurate too since their TP/TN count would increase with each message but the decrease in accuracy would have be less accelerated as it is with TEFT.

Do you understand this?

Ahh... and TEFT is constantly producing either new tokens or increasing the count (ham/spam) for tokens. TOE is not doing that. So in the long run TOE produces less data and still is more accurate than TEFT. TEFT is a brutal way of learning while TOE is more intelligent.

Now you will ask me why don't all the other use TOE and why does not DSPAM set TOE as default? Well the second one is easy to answer: TEFT was the default in the past (with CHAIN) and our release manger does not like us to change old defaults. For the first question: Most people just follow some how-to they find on the net, without even knowing what they do. And on small (very small) installs TEFT is producing very quickly results while TOE can take some time to kick in. But this is with CHAIN. Users using something like OSB don't suffer from this as much as users using CHAIN.



Otherwise, it sounded more like a guess. Perhaps, there is no way to know what mode should be used. If there was though, was hoping for some sort of reason or methodology. I had suggested / asked if perhaps, it was due to having a low rate of spam, percent wise. This was my attempt to put some reason into the change. I have not worked on the code, and, did not plan on reading the code to figure out exactly how it worked and the whys of it, had hoped someone else might have an explanation.

Wiping out all the training and stuff for a guess (IF it was a guess, not saying it was as I don't know) shouldn't be taken lightly (which is the suggestion). That's a lot of time and effort, though, it wasn't yielding the greatest results anyway.

It's not that much time. It should not take you more than a overnight automatic training run to produce a very good merged global group.


So, if there is no way to know, and the only solution is to simply reload DSPAM and try a dozen combinations, that's not a very good use of my time. I'd probably just eliminate DSPAM at that point and use another product that does not require so much time.

If there is a way to know or make some sense out of it, I'd love to hear it. That's all I am saying. I hope that makes more sense and is not unreasonable.


------------------------------------------------------------------------------
Better than sec? Nothing is better than sec when it comes to
monitoring Big Data applications. Try Boundary one-second
resolution app monitoring today. Free.
http://p.sf.net/sfu/Boundary-dev2dev


_______________________________________________
Dspam-user mailing list
Dspam-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspam-user


--
Kind Regards from Switzerland,

Stevan Bajić

------------------------------------------------------------------------------
Better than sec? Nothing is better than sec when it comes to
monitoring Big Data applications. Try Boundary one-second 
resolution app monitoring today. Free.
http://p.sf.net/sfu/Boundary-dev2dev
_______________________________________________
Dspam-user mailing list
Dspam-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspam-user

Reply via email to