http://bugzilla.spamassassin.org/show_bug.cgi?id=4100
Summary: The ranking measure in hit-frequencies is suspicious
Product: Spamassassin
Version: unspecified
Platform: Other
OS/Version: other
Status: NEW
Severity: normal
Priority: P5
Component: Masses
AssignedTo: [email protected]
ReportedBy: [EMAIL PROTECTED]
Dear colleagues,
In the source code of masses/hit-frequencies, the ranking measure is based on
the following formula:
# sum P(X = x ^ C = c)
# IG(X,C) = x in [0, 1] P(X = x ^ C = c) . log2( ------------------- )
# c in [Ch, Cs] P(X = x) . P(C = c)
This formula may be useful for a general categorization problem, but not for
SpamAssassin. The reasons are:
1. For general categorization problem, we are interested in words which can
guest either class ham or spam. However in SpamAssassin, we are only interested
in words which can guest class spam. (I assumed that ham-rules are not a good
choice)
2. For general clategorization problem, we are interested in the present as
well as the absence of a word in a e-mail. However in SpamAssassin, we are only
interested in the present of a word in a e-mail.
In my opinion, the measure should be:
$rank =
P(X = 1 ^ C = Cs)
P(X = 1 ^ C = Cs) . log2( ------------------- ) /
P(X = 1) . P(C = Cs)
P(X = 1 ^ C = Ch)
(P(X = 1 ^ C = Ch) . log2( ------------------- ))
P(X = 1) . P(C = Ch)
Just a suggestion.
Best Regards,
Quang-Anh Tran
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.