RE: :Categorize slides from YAPC online

Tolkin, Steve Wed, 20 Jun 2001 07:39:59 -0700
Dear Ken et al.,
        I have been saving the spam email sent to me so that one day 
I could use a module like this to detect it.  
I could make available to anyone a *.csv (comma separated values) file
that contains about 440 spam messages.  It has a total of 
about 27000 lines and is 1.1 MB (400 KB compressed with zip).

This file would need to be complemented by a file of the same size, 
or ideally even larger, than contain non-spam.  I could not
distribute a non-spam file as most messages are company confidential.  
Anyway each person would want to have their own non-spam file, 
as I get messages about databases, XML, etc. and other people 
would get other messages on other topics.

I want to pursue how to incorporate your code into an actual
solution.  I have some theoretical questions, and some practical
ones.

* If there are only two categories (spam vs. non-spam)
is there some special algorithm that is appropriate?

* Actually a high fraction of the spam messages are in Spanish.
I copuld manually separate these out very quickly.
Would it help improve performance (i.e. better F1 score)
to have these in a separate category?

* An easy way to detect these Spanish messages is to look for 
the Perl pattern / esta/i
But I am concerned that this strength of this predictor would 
be "diluted" due to the many word forms.  

* Probably the best way to detect spam is to look for a number
in the subject line, e.g., FREE Life Insurance Quotes    10077
However I suspect your code would treat all these numbers as
different words and so not notice the pattern.  It seems desirable
to first transform the input in certain ways.  I might want to
transform strings such as 19\d\d and 20\d\d to e.g. the dummy word
_date and then transform all other numbers with 4 or more digits
to e.g. _number.  Then a verey string predictor of spam is
_number in the subject.

* Similarly I want to map punctuation to pseudo-words, e.g. any string 
of more than one consecutive ! character would become _bang.

Some additional background: My company uses Microsoft
Outlook, which has a "rules wizard" with some limited ability
to route mail to different folders, based on who it is from, words
in the subject etc.  I am currently using Outlook 98, but plan
to go to Outlook 2000 soon.

* I would like to get a list of the words most likely to be associated
with a category.  Can I get this from your code?  How?
E.g. for the spam category I expect to find Britney, free, etc.
This is very important because Outlook rules can move mail based
on words.  I would be willing to move any email containing
"Britney" to a spam_probably folder.

* Does Outlook 2000 add much in the way of what the rules
pay attention to?

* Ideally I could set up my mail system to make a call to
some external program, and it would return a category.
Is this possible to do in the Outlook client, 
or in the Exchange server?

* Can Outlook filter messages based on the domain of the sender.
My first filter would accept anything sent from within the company,
or from certain known outside addresses.  Then I would
categorize all the remaining messages into spam vs non-spam.
 
Hopefully helpfully yours,
Steve
-- 
Steven Tolkin          [EMAIL PROTECTED]      617-563-0516 
Fidelity Investments   82 Devonshire St. V10D    Boston MA 02109
There is nothing so practical as a good theory.  Comments are by me, 
not Fidelity Investments, its subsidiaries or affiliates.

> -----Original Message-----
> From: Ken Williams [mailto:[EMAIL PROTECTED]]
> Sent: Tuesday, June 19, 2001 5:14 PM
> To: [EMAIL PROTECTED]
> Subject: AI::Categorize slides from YAPC online
> 
> 
> Hi perl-ai list,
> 
> The slides from my YAPC talk on AI::Categorize are online now, at:
> 
>   http://mathforum.com/~ken/categorize/
> 
> Please take a look if you're interested.  The same slides will be
> available on www.yapc.org when Kevin has time to put them there.
> 
> Several people at the talk expressed interest in helping with
> development of AI::Categorize:: modules.  Here are my thoughts:
> 
>   * If you want to implement a new algorithm (besides 
> NaiveBayes and kNN,
>   which I've already done), just go ahead and do it and 
> release to CPAN. 
>   You don't need to discuss it with me unless you want to.  
> The modules
>   should be in the AI::Categorize:: namespace, and subclasses of
>   AI::Categorize.
>   
>   * Discussions & announcements should take place on this 
> list, so that
>   people with more knowledge than me can chime in.  If the 
> traffic gets
>   too much, we can split off to a new list.  But at least for 
> a while, it
>   would be nice to get some meat into the perl-ai list 
> archives. =)  Let's 
>   post often, as I'm sure there's a lot of knowledge people 
> have to share,
>   as well as a lot of people who'd like to listen.
>   
>   * If anyone has additions/changes/fixes to the existing 
> modules, don't
>   hold them back.  For example, there was discussion of 
> adding stuff to
>   reduce the feature sets (number of words considered important) by
>   looking at their cross-entropy, and I'd like to get that in there.
>   
> As I mentioned at the talk, the main reason I created this 
> namespace and
> released the initial stuff was to jumpstart community efforts in this
> area.  It seemed strange that there wasn't anything on CPAN to do this
> kind of NLP stuff, when Perl seems so well-known in the NLP 
> community. 
> So I hope there will be interest from people on this list 
> (and that the
> interested people from YAPC are indeed subscribed!).
> 
> 
>   -------------------                            -------------------
>   Ken Williams                             Last Bastion of Euclidity
>   [EMAIL PROTECTED]                            The Math Forum
>
RE: :Categorize slides from YAPC online

Reply via email to