http://bugzilla.spamassassin.org/show_bug.cgi?id=4100

           Summary: The ranking measure in hit-frequencies is suspicious
           Product: Spamassassin
           Version: unspecified
          Platform: Other
        OS/Version: other
            Status: NEW
          Severity: normal
          Priority: P5
         Component: Masses
        AssignedTo: [email protected]
        ReportedBy: [EMAIL PROTECTED]


Dear colleagues,

In the source code of masses/hit-frequencies, the ranking measure is based on 
the following formula:

#             sum                                    P(X = x ^ C = c)
# IG(X,C) = x in [0, 1]    P(X = x ^ C = c) . log2( ------------------- )
#           c in [Ch, Cs]                           P(X = x) . P(C = c)

This formula may be useful for a general categorization problem, but not for 
SpamAssassin. The reasons are:

1. For general categorization problem, we are interested in words which can 
guest either class ham or spam. However in SpamAssassin, we are only interested 
in words which can guest class spam. (I assumed that ham-rules are not a good 
choice)

2. For general clategorization problem, we are interested in the present as 
well as the absence of a word in a e-mail. However in SpamAssassin, we are only 
interested in the present of a word in a e-mail.

In my opinion, the measure should be:

$rank =

                           P(X = 1 ^ C = Cs)
P(X = 1 ^ C = Cs) . log2( ------------------- ) /
                         P(X = 1) . P(C = Cs)

                           P(X = 1 ^ C = Ch)
(P(X = 1 ^ C = Ch) . log2( ------------------- ))
                         P(X = 1) . P(C = Ch)

Just a suggestion.

Best Regards,
Quang-Anh Tran



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

Reply via email to