On Wednesday, February 23, 2005, 3:06:03 PM, Scott wrote: SF> -Mad,
SF> Will there be an MDLP page explaining some of the columns? SF> SQ= Spam Test Quality? SF> SI = Spam Test Result Important Count? SF> avgSD = Average Spam Test Dominance? Yes. Once I get a few minutes to rub together I'll make fire and get that MDLP page populated :-) In the mean time here is this little bit... SQ = SA^2 ... This expands the accuracy fraction and eliminates the sign. SQ is how the tests compete for high weights in the AI. Think of it as the square of the distance to the goal... that goal being a perfect score. SI = Spam test importance count. Each time a message is scanned it is an event. With each event some number of tests will fire. The tests that fire together during a scan event each contribute a to the total weight. Any single test which has a weight high enough to swing the total weight across the spam/ham threshold is "important". In a highly accurate system we would like to see many tests appear "important" during a scan event because this ensures that the tests involved are not "swamping" the result. For example, suppose we scan a message and we get 5 tests answering: TEST1 = 50 TEST2 = 35 TEST3 = 10 TEST4 = 64 TEST5 = 26 TOTAL = 185 None of the tests are "important" because no single test is enough of the total weight for the weight to cross the threshold. Take away TEST4, for example, and the total only falls to 121 which is still above 100. This might be perfectly fine... that is, the message may simply be so "spammy" that we would be happy if only half of the tests "agreed" about it... But, that way of thinking isn't very sensitive to error, and since the system must evaluate the test accuracy based on the aggregate results of many tests it is very important that the system be as sensitive as possible. This way it's not likely to "fool" itself into "believing" a handful of tests and then evaluating all others against them. (We don't want any emperors sporting new outfits ;-) Now suppose we use lesser weights on the bigger tests --- TEST1 = 35 TEST2 = 30 TEST3 = 10 TEST4 = 44 TEST5 = 23 TOTAL = 142 Now we can see that by scaling back the weights a bit we raise the sensitivity for TEST4. If you take away TEST4 in this case then the total drops to 98 which is below the threshold - so in this even TEST4 was "important". SI is a count of the number of events in the data set where the test in question was "important". This leads us to avgSD. Once we start down the road of looking for this sensitivity we find we want more of it. Consider the following: TEST1 = 30 TEST2 = 25 TEST3 = 8 TEST4 = 40 TEST5 = 20 TOTAL = 123 Now we can see that there are 3 "important" tests. TEST1, TEST2, and TEST4 are all big enough to tip the scale. As a result, if any of these tests "don't agree" then the message would be passed as non-spam (in this event anyway). Tests dominance is a measure of how many other important tests are present when a given test is "important". In the case with a total of 142, TEST4 was completely dominant so it's test dominance number for that event would be 1.0. Put in other words, the only thing that "mattered" in this case was TEST4. The other tests were present, but they would have had to "gang up" in order for them to effectively disagree with TEST4. This can be a dangerous thing since it means that TEST4 can buy itself a high accuracy score very easily. In the event with the total weight 123, TEST1, TEST2, and TEST4 would all receive a test dominance score of .333 --- that is to say, the test is "one of three (1/3)" of the "important" tests during that event. This is much better. Every one of these tests now has to work hard in order to get a high accuracy score - and that's what we want. avgSD is the average of the test dominance numbers that are given to a particular test. The AI is sensitive to this number so that the instinct of a test with a very high avgSD is to reduce it's weight so that it plays well with the other tests. There are many other "instincts" given to the AI creatures that live on (represent) these tests also --- but this is how these particular numbers get their meaning. One way you can think of it is that these numbers (and many of the others you see on the page) are how the creatures "feel/see" their world. Hope this helps, _M --- [This E-mail was scanned for viruses by Declude Virus (http://www.declude.com)] --- This E-mail came from the Declude.JunkMail mailing list. To unsubscribe, just send an E-mail to [EMAIL PROTECTED], and type "unsubscribe Declude.JunkMail". The archives can be found at http://www.mail-archive.com.
