I've only just noticed this thread now. Sorry for the delay in response.
--
Re: Boosting.
I'm really not a fan of ensemble learning algorithms such as boosting and bagging. IMO, it is a hack used to prop up unstable learning algorithms such as ID3 and C5.0.
What would be far more useful is an implementation of support vector machine (SVM) that supports our security constraints (p(spam|m) < p(spam|m + one more not-nice rule)). More in the next section.
--
Re: Open research areas
1. Online self-supervised learning with SVM.
SpamAssassin's classifier should be able to adjust its weights over time based on its input. This would allow it to adapt to a user's (or site's) mail in the same way that the personalised Bayes classifier does. Since we don't want to ship 500MB of data with every SpamAssassin and each site processing a rapidly growing corpus, this should be made to work correctly with only a minimal subset of the training data being retained.
This is an extension of my aborted doctoral research.
2. Rule selection.
With so many features in spam, it is difficult to find the optimal set of rules to ship with SpamAssassin. Scanning time increases linearly with the number of rules (see #3!), so every rule that is added slows down the engine. As Justin mentioned, there is often a lot of overlap between rules. I've been working on a genetic algorithm for selecting which rules to include but my work thus far is limited in scope. Please see trunk/masses/evolve_metarule for what I've done so far. I'd like to expand this in scope so that it can be effectively used with the full ruleset.
3. Faster body scanning.
Every body rule in SpamAssassin requires a separate pass through the message. The time complexity of this is O(n). If all of the rules are combined into one using a trie structure, the rules can be evaluated in O(log n) time. However, using the latter method it still requires a (less expensive) O(n) operation to find which rules have been satisfied. A valuable area of research that will really help SpamAssassin would be to develop a method of evaluating all of the regular expressions in O(log n) time and finding exactly which rules have been satisfied in O(log n) time.
There are many more open research areas related to SpamAssassin, so don't limit your scope to what I've suggested here. If anyone else has ideas, please chime in.
Cheers, Henry
Ted Markowitz wrote:
In this same vein of exploring concepts like the application of boosting algorithms or using meta rulesets to enhance the SA classification process, I've been looking for an interesting doctoral dissertation topic in the spam domain for a some time now and was wondering if folks in the SA community had some ideas rolling around in the back of their minds that would lend themselves to doctoral-level research? Perhaps some area you'd really like to explore yourself, "if only you had the time.":-)
My program in CS is especially geared towards folks with a lot of hands-on, real world IT experience, and so topics with an applied research & development bent and a serious coding component are quite OK. Any ideas, interesting leads, or useful pointers would be much appreciated.
Thanks muchly for your thoughts.
--ted
Sidney Markowitz wrote:
Fred wrote:
There was similar work being done in the past to identify rules to be grouped into new meta rules, this (w|c)ould achieve similar results. http://bugzilla.spamassassin.org/show_bug.cgi?id=1363
I think I'm missing something here. Are you saying that automatically grouping rules into meta rules that have similar classification properties is equivalent to boosting? Or do you mean that it is another approach that also can improve performance of weak learners?
In any case, you have given me an idea for the microarray gene expression problem, so thanks! :-)
-- sidney
--
================================================================ Ted Markowitz Chief Architect Cognosys LLC (http://www.cognosys.net) 10 Hamilton Lane, Darien, CT 06820-2809, USA ---------------------------------------------------------------- 203-655-2400 (phone/fax) 203-984-6565 (cell) [EMAIL PROTECTED] (email) TJMarkowitz (AIM ID) ================================================================ NOTICE: This e-mail, including attachments, is intended solely for the person(s) or organization(s) shown in the message's header and may contain confidential and/or legally privileged information. Any unauthorized disclosure, copying, or other unapproved use or retransmission of this information may be unlawful and is strictly prohibited. If you are not the intended recipient, please delete this message immediately. ================================================================
signature.asc
Description: OpenPGP digital signature
