Dear Ken et al.,
I have been saving the spam email sent to me so that one day
I could use a module like this to detect it.
I could make available to anyone a *.csv (comma separated values) file
that contains about 440 spam messages. It has a total of
about 27000 lines and is 1.1 MB (400 KB compressed with zip).
This file would need to be complemented by a file of the same size,
or ideally even larger, than contain non-spam. I could not
distribute a non-spam file as most messages are company confidential.
Anyway each person would want to have their own non-spam file,
as I get messages about databases, XML, etc. and other people
would get other messages on other topics.
I want to pursue how to incorporate your code into an actual
solution. I have some theoretical questions, and some practical
ones.
* If there are only two categories (spam vs. non-spam)
is there some special algorithm that is appropriate?
* Actually a high fraction of the spam messages are in Spanish.
I copuld manually separate these out very quickly.
Would it help improve performance (i.e. better F1 score)
to have these in a separate category?
* An easy way to detect these Spanish messages is to look for
the Perl pattern / esta/i
But I am concerned that this strength of this predictor would
be "diluted" due to the many word forms.
* Probably the best way to detect spam is to look for a number
in the subject line, e.g., FREE Life Insurance Quotes 10077
However I suspect your code would treat all these numbers as
different words and so not notice the pattern. It seems desirable
to first transform the input in certain ways. I might want to
transform strings such as 19\d\d and 20\d\d to e.g. the dummy word
_date and then transform all other numbers with 4 or more digits
to e.g. _number. Then a verey string predictor of spam is
_number in the subject.
* Similarly I want to map punctuation to pseudo-words, e.g. any string
of more than one consecutive ! character would become _bang.
Some additional background: My company uses Microsoft
Outlook, which has a "rules wizard" with some limited ability
to route mail to different folders, based on who it is from, words
in the subject etc. I am currently using Outlook 98, but plan
to go to Outlook 2000 soon.
* I would like to get a list of the words most likely to be associated
with a category. Can I get this from your code? How?
E.g. for the spam category I expect to find Britney, free, etc.
This is very important because Outlook rules can move mail based
on words. I would be willing to move any email containing
"Britney" to a spam_probably folder.
* Does Outlook 2000 add much in the way of what the rules
pay attention to?
* Ideally I could set up my mail system to make a call to
some external program, and it would return a category.
Is this possible to do in the Outlook client,
or in the Exchange server?
* Can Outlook filter messages based on the domain of the sender.
My first filter would accept anything sent from within the company,
or from certain known outside addresses. Then I would
categorize all the remaining messages into spam vs non-spam.
Hopefully helpfully yours,
Steve
--
Steven Tolkin [EMAIL PROTECTED] 617-563-0516
Fidelity Investments 82 Devonshire St. V10D Boston MA 02109
There is nothing so practical as a good theory. Comments are by me,
not Fidelity Investments, its subsidiaries or affiliates.
> -----Original Message-----
> From: Ken Williams [mailto:[EMAIL PROTECTED]]
> Sent: Tuesday, June 19, 2001 5:14 PM
> To: [EMAIL PROTECTED]
> Subject: AI::Categorize slides from YAPC online
>
>
> Hi perl-ai list,
>
> The slides from my YAPC talk on AI::Categorize are online now, at:
>
> http://mathforum.com/~ken/categorize/
>
> Please take a look if you're interested. The same slides will be
> available on www.yapc.org when Kevin has time to put them there.
>
> Several people at the talk expressed interest in helping with
> development of AI::Categorize:: modules. Here are my thoughts:
>
> * If you want to implement a new algorithm (besides
> NaiveBayes and kNN,
> which I've already done), just go ahead and do it and
> release to CPAN.
> You don't need to discuss it with me unless you want to.
> The modules
> should be in the AI::Categorize:: namespace, and subclasses of
> AI::Categorize.
>
> * Discussions & announcements should take place on this
> list, so that
> people with more knowledge than me can chime in. If the
> traffic gets
> too much, we can split off to a new list. But at least for
> a while, it
> would be nice to get some meat into the perl-ai list
> archives. =) Let's
> post often, as I'm sure there's a lot of knowledge people
> have to share,
> as well as a lot of people who'd like to listen.
>
> * If anyone has additions/changes/fixes to the existing
> modules, don't
> hold them back. For example, there was discussion of
> adding stuff to
> reduce the feature sets (number of words considered important) by
> looking at their cross-entropy, and I'd like to get that in there.
>
> As I mentioned at the talk, the main reason I created this
> namespace and
> released the initial stuff was to jumpstart community efforts in this
> area. It seemed strange that there wasn't anything on CPAN to do this
> kind of NLP stuff, when Perl seems so well-known in the NLP
> community.
> So I hope there will be interest from people on this list
> (and that the
> interested people from YAPC are indeed subscribed!).
>
>
> ------------------- -------------------
> Ken Williams Last Bastion of Euclidity
> [EMAIL PROTECTED] The Math Forum
>