Re: [OT] "Boosting" and other potential research topics

Henry Stern Tue, 17 May 2005 17:34:05 -0700

I've only just noticed this thread now.  Sorry for the delay in response.

--

Re: Boosting.

I'm really not a fan of ensemble learning algorithms such as boosting
and bagging.  IMO, it is a hack used to prop up unstable learning
algorithms such as ID3 and C5.0.

What would be far more useful is an implementation of support vector
machine (SVM) that supports our security constraints (p(spam|m) <
p(spam|m + one more not-nice rule)).  More in the next section.

--

Re: Open research areas

1.  Online self-supervised learning with SVM.

SpamAssassin's classifier should be able to adjust its weights over time
based on its input.  This would allow it to adapt to a user's (or
site's) mail in the same way that the personalised Bayes classifier
does.  Since we don't want to ship 500MB of data with every SpamAssassin
and each site processing a rapidly growing corpus, this should be made
to work correctly with only a minimal subset of the training data being
retained.

This is an extension of my aborted doctoral research.

2.  Rule selection.

With so many features in spam, it is difficult to find the optimal set
of rules to ship with SpamAssassin.  Scanning time increases linearly
with the number of rules (see #3!), so every rule that is added slows
down the engine.  As Justin mentioned, there is often a lot of overlap
between rules.  I've been working on a genetic algorithm for selecting
which rules to include but my work thus far is limited in scope.  Please
see trunk/masses/evolve_metarule for what I've done so far.  I'd like to
expand this in scope so that it can be effectively used with the full
ruleset.

3.  Faster body scanning.

Every body rule in SpamAssassin requires a separate pass through the
message.  The time complexity of this is O(n).  If all of the rules are
combined into one using a trie structure, the rules can be evaluated in
O(log n) time.  However, using the latter method it still requires a
(less expensive) O(n) operation to find which rules have been satisfied.
 A valuable area of research that will really help SpamAssassin would
be to develop a method of evaluating all of the regular expressions in
O(log n) time and finding exactly which rules have been satisfied in
O(log n) time.

There are many more open research areas related to SpamAssassin, so
don't limit your scope to what I've suggested here.  If anyone else has
ideas, please chime in.

Cheers,
Henry

Ted Markowitz wrote:

In this same vein of exploring concepts like the application of boosting
algorithms or using meta rulesets to enhance the SA classification
process, I've been looking for an interesting doctoral dissertation
topic in the spam domain for a some time now and was wondering if folks
in the SA community had some ideas rolling around in the back of their
minds that would lend themselves to doctoral-level research? Perhaps
some area you'd really like to explore yourself, "if only you had the
time.":-)

My program in CS is especially geared towards folks with a lot of
hands-on, real world IT experience, and so topics with an applied
research & development bent and a serious coding component are quite OK.
Any ideas, interesting leads, or useful pointers would be much appreciated.

Thanks muchly for your thoughts.

--ted

Sidney Markowitz wrote:

Fred wrote:

There was similar work being done in the past to identify rules to be
grouped into new meta rules, this (w|c)ould achieve similar results.
http://bugzilla.spamassassin.org/show_bug.cgi?id=1363


I think I'm missing something here. Are you saying that automatically
grouping rules into meta rules that have similar classification properties
is equivalent to boosting? Or do you mean that it is another approach that
also can improve performance of weak learners?

In any case, you have given me an idea for the microarray gene expression
problem, so thanks! :-)

-- sidney

--

================================================================
Ted Markowitz
Chief Architect
Cognosys LLC (http://www.cognosys.net)
10 Hamilton Lane, Darien, CT 06820-2809, USA
----------------------------------------------------------------
203-655-2400 (phone/fax)                     203-984-6565 (cell)
[EMAIL PROTECTED] (email)                    TJMarkowitz (AIM ID)
================================================================
 NOTICE: This e-mail, including attachments, is intended solely
 for the person(s) or organization(s) shown in the message's
 header and may contain confidential and/or legally privileged
 information.  Any unauthorized disclosure, copying, or other
 unapproved use or retransmission of this information may be
 unlawful and is strictly prohibited.  If you are not the
 intended recipient, please delete this message immediately.
================================================================

signature.asc
Description: OpenPGP digital signature

Re: [OT] "Boosting" and other potential research topics

Reply via email to