On Wednesday, February 23, 2005, 3:06:03 PM, Scott wrote:

SF> -Mad,

SF> Will there be an MDLP page explaining some of the columns?
SF> SQ= Spam Test Quality?
SF> SI = Spam Test Result Important Count?
SF> avgSD = Average Spam Test Dominance?

Yes. Once I get a few minutes to rub together I'll make fire and get
that MDLP page populated :-)

In the mean time here is this little bit...

SQ = SA^2 ... This expands the accuracy fraction and eliminates the
sign. SQ is how the tests compete for high weights in the AI. Think of
it as the square of the distance to the goal... that goal being a
perfect score.

SI = Spam test importance count. Each time a message is scanned it is
an event. With each event some number of tests will fire. The tests
that fire together during a scan event each contribute a to the total
weight. Any single test which has a weight high enough to swing the
total weight across the spam/ham threshold is "important". In a highly
accurate system we would like to see many tests appear "important"
during a scan event because this ensures that the tests involved are
not "swamping" the result.

For example, suppose we scan a message and we get 5 tests answering:

TEST1 = 50
TEST2 = 35
TEST3 = 10
TEST4 = 64
TEST5 = 26

TOTAL = 185

None of the tests are "important" because no single test is enough of
the total weight for the weight to cross the threshold. Take away
TEST4, for example, and the total only falls to 121 which is still
above 100. This might be perfectly fine... that is, the message may
simply be so "spammy" that we would be happy if only half of the tests
"agreed" about it... But, that way of thinking isn't very sensitive to
error, and since the system must evaluate the test accuracy based on
the aggregate results of many tests it is very important that the
system be as sensitive as possible. This way it's not likely to "fool"
itself into "believing" a handful of tests and then evaluating all
others against them. (We don't want any emperors sporting new outfits
;-)

Now suppose we use lesser weights on the bigger tests ---

TEST1 = 35
TEST2 = 30
TEST3 = 10
TEST4 = 44
TEST5 = 23

TOTAL = 142

Now we can see that by scaling back the weights a bit we raise the
sensitivity for TEST4. If you take away TEST4 in this case then the
total drops to 98 which is below the threshold - so in this even TEST4
was "important".

SI is a count of the number of events in the data set where the test
in question was "important".

This leads us to avgSD.

Once we start down the road of looking for this sensitivity we find we
want more of it. Consider the following:

TEST1 = 30
TEST2 = 25
TEST3 = 8
TEST4 = 40
TEST5 = 20

TOTAL = 123

Now we can see that there are 3 "important" tests. TEST1, TEST2, and
TEST4 are all big enough to tip the scale. As a result, if any of
these tests "don't agree" then the message would be passed as
non-spam (in this event anyway).

Tests dominance is a measure of how many other important tests are
present when a given test is "important". In the case with a total of
142, TEST4 was completely dominant so it's test dominance number for
that event would be 1.0. Put in other words, the only thing that
"mattered" in this case was TEST4. The other tests were present, but
they would have had to "gang up" in order for them to effectively
disagree with TEST4. This can be a dangerous thing since it means that
TEST4 can buy itself a high accuracy score very easily.

In the event with the total weight 123, TEST1, TEST2, and TEST4 would
all receive a test dominance score of .333 --- that is to say, the
test is "one of three (1/3)" of the "important" tests during that
event. This is much better. Every one of these tests now has to work
hard in order to get a high accuracy score - and that's what we want.

avgSD is the average of the test dominance numbers that are given to a
particular test.

The AI is sensitive to this number so that the instinct of a test with
a very high avgSD is to reduce it's weight so that it plays well with
the other tests.

There are many other "instincts" given to the AI creatures that live
on (represent) these tests also --- but this is how these particular
numbers get their meaning.

One way you can think of it is that these numbers (and many of the
others you see on the page) are how the creatures "feel/see" their
world.

Hope this helps,

_M



---
[This E-mail was scanned for viruses by Declude Virus (http://www.declude.com)]

---
This E-mail came from the Declude.JunkMail mailing list.  To
unsubscribe, just send an E-mail to [EMAIL PROTECTED], and
type "unsubscribe Declude.JunkMail".  The archives can be found
at http://www.mail-archive.com.

Reply via email to