Wayne Wilson wrote:
> Annotated terms are one thing, and so far need a human
> editor to keep them up to date.
> 
> word lists are another thing.  These can be automated.  You
> can enumerate them from source and do a concordance.
> 
> But I think there is another interesting approach.
> 
> I am sure most of us deal with e-mail 'spam'.   Rule based
> mail filters only go so far to control it.  However, much
> empirical evidence and practical experience exists which
> suggests that 'self learning' systems based on statistical
> semantics or bayesian filters achieve better performance
> than rules in a very short time.  (see
> http://www.paulgraham.com/spam.html "A Plan for Spam")

David Mertz has written a nice review of approaches to
the spam problem, including Bayesian and other ML
(machine learning) techniques - see
http://www-106.ibm.com/developerworks/linux/library/l-spamf.html

> 
> Apple has included the semantic approach in Jaguar (OS X.2)
> native mail app.  All the unix mail guru's at umich have
> switched to it!
> The bayesian filter is available in open source and will be
> included in a future release of mozilla.  I have seen these
> systems work, they are amazing.  The mail community of the
> Internet have basically given up on controlling spam at the
> protocol level, and here comes some simple and elegant
> client software that does the job!

These filters can be used centrally on listservers etc, but
someone needs to train the system, and to update the training
data on an ongoing basis. Most sysadmins are already 
overburdened, but it seems likely that facilities which could
allow others to administer such training via Web interfaces,
such as MailMan, will incorporate ML filtering in future 
versions.

> 
> Using something like this to pipe all your 'medical records
> of interest' through, will soon learn your particular
> vocabulary.  Who cares if it includes non-medical terms.
> (you can train it to ignore those anyway!)  Make it a small
> group system and it learns your group's vocabulary.

The Bayesian filters are classifiers - they decide whether
a particular message is spam, non-spam, or maybe not sure.
That is different to the "information extraction" problem,
which is concerned with ferretting out particular types of
information from a body of text. Statistical machine learning
techniques are also used for information extraction, but 
slightly different ones from those used for classification.
For an example of open source machine learning information
extraction, see 
http://datamining.anu.edu.au/projects/linkage.html#prototype_software

:-)

But don't ask me how you would build a system to extract 
previously unencountered medical terms from a body of text
- that seems like a very challenging problem. 

Tim C

> 
> Such a system is incrementally updated and kept current by
> the only people who care,  the users of it.

Reply via email to