Thanks for the thoughts, Chris. I've been thinking along some of these same lines myself as to exactly how much more effective N-gram phrases (with some arbitrarily N) would be vis a vis the Bayesian classifier. I seem to remember some research along these lines described at the 2005 MIT Spam Conference by Jonathan Zdziarski of DSPAM where he talks about "Bayesian Noise Reduction" using N-gram sized phrases as "meta-tokens" which can then be fed into some spam/ham classifier.

Cheers,

--ted

Chris Santerre wrote:
Well....this kind of goes along the idea of "bayes chains". You can look into which pairs/treos of bayes tokens hit the most spam and least ham. Same goes for rules. There are some scripts around the community to give top hitting rules, which might come in very useful.
 
Once you find these magic pairs/treos, it should be relativley easy to meta them together. Although I'm not sure how you would do that on the bayes token side, as I think it kind of already is handled. Its public knowledge that I dislike bayes and don't use it :)
 
Its a good idea, and prbly the next best step to look at.
 
HTH,
 

Chris Santerre
System Admin and SARE/URIBL Ninja
http://www.rulesemporium.com
http://www.uribl.com

-----Original Message-----
From: Ted Markowitz [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, May 04, 2005 6:55 PM
To: [email protected]
Subject: [OT] "Boosting" and other potential research topics

In this same vein of exploring concepts like the application of boosting algorithms or using meta rulesets to enhance the SA classification process, I've been looking for an interesting doctoral dissertation topic in the spam domain for a some time now and was wondering if folks in the SA community had some ideas rolling around in the back of their minds that would lend themselves to doctoral-level research? Perhaps some area you'd really like to explore yourself, "if only you had the time.":-)

My program in CS is especially geared towards folks with a lot of hands-on, real world IT experience, and so topics with an applied research & development bent and a serious coding component are quite OK. Any ideas, interesting leads, or useful pointers would be much appreciated.

Thanks muchly for your thoughts.

--ted

Sidney Markowitz wrote:
Fred wrote:
  
There was similar work being done in the past to identify rules to be
grouped into new meta rules, this (w|c)ould achieve similar results.
http://bugzilla.spamassassin.org/show_bug.cgi?id=1363
    

I think I'm missing something here. Are you saying that automatically
grouping rules into meta rules that have similar classification properties
is equivalent to boosting? Or do you mean that it is another approach that
also can improve performance of weak learners?

In any case, you have given me an idea for the microarray gene _expression_
problem, so thanks! :-)

 -- sidney
  

-- 

================================================================
Ted Markowitz
Chief Architect
Cognosys LLC (http://www.cognosys.net)
10 Hamilton Lane, Darien, CT 06820-2809, USA
----------------------------------------------------------------
203-655-2400 (phone/fax)                     203-984-6565 (cell)
[EMAIL PROTECTED] (email)                    TJMarkowitz (AIM ID)
================================================================
 NOTICE: This e-mail, including attachments, is intended solely
 for the person(s) or organization(s) shown in the message's
 header and may contain confidential and/or legally privileged
 information.  Any unauthorized disclosure, copying, or other
 unapproved use or retransmission of this information may be
 unlawful and is strictly prohibited.  If you are not the
 intended recipient, please delete this message immediately.
================================================================

    

-- 

================================================================
Ted Markowitz
Chief Architect
Cognosys LLC (http://www.cognosys.net)
10 Hamilton Lane, Darien, CT 06820-2809, USA
----------------------------------------------------------------
203-655-2400 (phone/fax)                     203-984-6565 (cell)
[EMAIL PROTECTED] (email)                    TJMarkowitz (AIM ID)
================================================================
 NOTICE: This e-mail, including attachments, is intended solely
 for the person(s) or organization(s) shown in the message's
 header and may contain confidential and/or legally privileged
 information.  Any unauthorized disclosure, copying, or other
 unapproved use or retransmission of this information may be
 unlawful and is strictly prohibited.  If you are not the
 intended recipient, please delete this message immediately.
================================================================

begin:vcard
fn:Ted Markowitz
n:Markowitz;Ted
org:Cognosys LLC
adr:;;10 Hamilton Lane;Darien;CT;06820-2809;USA
email;internet:[EMAIL PROTECTED]
title:Chief Architect
tel;work:203-655-2400
tel;fax:203-655-2400
tel;cell:203-984-6565
x-mozilla-html:TRUE
url:http://www.cognosys.net
version:2.1
end:vcard

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply via email to