On 18.04.2012 22:38, Steve Fatula wrote:
*From:* Bradley Giesbrecht <bradley.giesbre...@gmail.com>
*To:* Steve Fatula <compconsult...@yahoo.com>
*Cc:* Dspam List <dspam-user@lists.sourceforge.net>
*Sent:* Wednesday, April 18, 2012 3:04 PM
*Subject:* Re: [Dspam-user] Increase Spam Hit Rate
I can't help you other then to point out that you may have missed
the two replies prior to the one you responded to, both of which
suggested switching from 'TrainingMode TEFT' to 'TrainingMode TOE'.
Ok, so, what I had tried to ask for and was hoping to get was some
sort of explanation as to why this might be the case when so many use
TEFT, yet, I should try TOE.
English is not my native language and explaining something so technical
in English instead of my native language is not always as easy as one
might thing. But here we go... I will try to explain what TEFT is and
why TOE is better.
TEFT stands for 'train everything'. Most users will tell you that it
stands for 'train every fucking time'.
TOE stands for 'train on error'.
Allow me to ask how you self learn? I mean you Steve Fatula. How do you
learn? How did you learn as kid that 2 plus 2 is equal to 4? Probably
you learned it once and since then it was in your mind. Probably in the
beginning you learned it symbolically that the 'picture' '2+2' is '4'.
And later you learned that '+' is an addition and you learned how that
'+' is working and you learned the numbers and after you learned that
logic/mechanism you where able to sum almost any number with almost any
other number. Right?
After that you did not needed any more to learn how addition worked.
Right? Until that moment where some one told you to sum '(-20) + 3'. You
probably got it wrong in the first place and then you learned how to do
it right. Until someone asked you to sum '(-10)+(-30)'. Probably you got
that wrong in the first place too and then you learned how to sum
multiple negative numbers and after that you where able to master that
too. Right?
This all above stands for the way how TOE works. It learns and is happy
till it makes an error and then it learns from its own errors.
TEFT on the other hand is learning constantly. Even the right answers.
It learns and learns and learns and learns.
One could now say that since TEFT is learning constantly that it is
improving constantly. But this is not the case. Learning is good but
TEFT is very easy over learning. In the beginning every one was
thinking: more learning = more catching spam
But today we know that this is not right. Imagine your brain would work
the same then you would almost not be able to exist. You would even not
be able to just read this message here. Instead of just reading the
letters and words you would LEARN the letters and words. Yeah. Learn
them and read them. The same learning as you did when you where a kid
and first started to learn read.
Another aspect of TEFT that is bad is the fact that most users never or
rarely train. So since TEFT is constantly learning it will constantly
learn WRONG things if the users don't correct. Allow me to explain:
1) 1+1=2
2) 1+2=3
3) 2*3=6
4) 7-2=4
5) 3*3=4
1, 2 and 3 are mathematically correct while 4 and 5 are wrong.
Now lets say that a user is running TEFT and that DSPAM is saying for
all the messages (1 to 5) that they are mathematically correct. So in
that case DSPAM would relearn 1, 2 and 3 (making that response stronger)
and learn WRONGLY that '7-2=4' and that '3*3=4'.
Now lets say that a user is running TOE and that DSPAM is saying for all
the messages (1 to 5) that they are mathematically correct. In that case
DSPAM would NOT LEARN ANYTHING. It will not wrongly learn anything.
This is a huge difference!
I real world most user are very lousy trainer. So even if they don't
train they constantly are making their token data less accurate if they
use TEFT. Would they have used TOE then they would not learned wrong
things. Their data would get less accurate too since their TP/TN count
would increase with each message but the decrease in accuracy would have
be less accelerated as it is with TEFT.
Do you understand this?
Ahh... and TEFT is constantly producing either new tokens or increasing
the count (ham/spam) for tokens. TOE is not doing that. So in the long
run TOE produces less data and still is more accurate than TEFT. TEFT is
a brutal way of learning while TOE is more intelligent.
Now you will ask me why don't all the other use TOE and why does not
DSPAM set TOE as default? Well the second one is easy to answer: TEFT
was the default in the past (with CHAIN) and our release manger does not
like us to change old defaults. For the first question: Most people just
follow some how-to they find on the net, without even knowing what they
do. And on small (very small) installs TEFT is producing very quickly
results while TOE can take some time to kick in. But this is with CHAIN.
Users using something like OSB don't suffer from this as much as users
using CHAIN.
Otherwise, it sounded more like a guess. Perhaps, there is no way to
know what mode should be used. If there was though, was hoping for
some sort of reason or methodology. I had suggested / asked if
perhaps, it was due to having a low rate of spam, percent wise. This
was my attempt to put some reason into the change. I have not worked
on the code, and, did not plan on reading the code to figure out
exactly how it worked and the whys of it, had hoped someone else might
have an explanation.
Wiping out all the training and stuff for a guess (IF it was a guess,
not saying it was as I don't know) shouldn't be taken lightly (which
is the suggestion). That's a lot of time and effort, though, it wasn't
yielding the greatest results anyway.
It's not that much time. It should not take you more than a overnight
automatic training run to produce a very good merged global group.
So, if there is no way to know, and the only solution is to simply
reload DSPAM and try a dozen combinations, that's not a very good use
of my time. I'd probably just eliminate DSPAM at that point and use
another product that does not require so much time.
If there is a way to know or make some sense out of it, I'd love to
hear it. That's all I am saying. I hope that makes more sense and is
not unreasonable.
------------------------------------------------------------------------------
Better than sec? Nothing is better than sec when it comes to
monitoring Big Data applications. Try Boundary one-second
resolution app monitoring today. Free.
http://p.sf.net/sfu/Boundary-dev2dev
_______________________________________________
Dspam-user mailing list
Dspam-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspam-user
--
Kind Regards from Switzerland,
Stevan Bajić
------------------------------------------------------------------------------
Better than sec? Nothing is better than sec when it comes to
monitoring Big Data applications. Try Boundary one-second
resolution app monitoring today. Free.
http://p.sf.net/sfu/Boundary-dev2dev
_______________________________________________
Dspam-user mailing list
Dspam-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspam-user