[EMAIL PROTECTED] (Tolkin, Steve) wrote:
>* If there are only two categories (spam vs. non-spam)
>is there some special algorithm that is appropriate?

That's only one category, actually, and each message is either in the
category or not.  There may be special things to do when there's only
one category, but so far I don't know them.

>* Actually a high fraction of the spam messages are in Spanish.
>I copuld manually separate these out very quickly.
>Would it help improve performance (i.e. better F1 score)
>to have these in a separate category?

It's possible.  The best way is to try both ways and then see which is
better.  For that, you need to have a big enough corpus that you can
train on one portion, and test on another.

>* An easy way to detect these Spanish messages is to look for 
>the Perl pattern / esta/i
>But I am concerned that this strength of this predictor would 
>be "diluted" due to the many word forms.  

Not to mention the false positives, which you want to avoid in an
application like this.  

>* Probably the best way to detect spam is to look for a number
>in the subject line, e.g., FREE Life Insurance Quotes    10077

This might be an effective way to recognize spam, but it's different
from the two existing AI::Categorize:: classes in that it involves a
hand-written rule.

>* I would like to get a list of the words most likely to be associated
>with a category.  Can I get this from your code?  How?
>E.g. for the spam category I expect to find Britney, free, etc.
>This is very important because Outlook rules can move mail based
>on words.  I would be willing to move any email containing
>"Britney" to a spam_probably folder.

So far there are no hooks for that.  You could examine the
AI::Categorize::NaiveBayes data structures and find the biggest log-prob
numbers (they're all negative) for a crude measure.  It's probably also
worth looking at the cross-entropy, which is on the todo list but not
implemented yet.

>* Ideally I could set up my mail system to make a call to
>some external program, and it would return a category.
>Is this possible to do in the Outlook client, 
>or in the Exchange server?

No idea, but let us know if you find out it can.


  -------------------                            -------------------
  Ken Williams                             Last Bastion of Euclidity
  [EMAIL PROTECTED]                            The Math Forum

Reply via email to