-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

there's been quite a bit of research into N-gram bayesian phrases;
I'd recommend reading the spambayes list archives, the bogofilter
archives, and I think Gordon Cormack covered its accuracy too.

summary: you'll massively expand database size for not a huge
gain, iirc ;)

Dobly noise reduction is slightly different though.

- --j.

Ted Markowitz writes:
> Thanks for the thoughts, Chris. I've been thinking along some of these 
> same lines myself as to exactly how much more effective N-gram phrases 
> (with some arbitrarily N) would be vis a vis the Bayesian classifier. I 
> seem to remember some research along these lines described at the 2005 
> MIT Spam Conference by Jonathan Zdziarski of DSPAM where he talks about 
> "Bayesian Noise Reduction" using N-gram sized phrases as "meta-tokens" 
> which can then be fed into some spam/ham classifier.
> 
> Cheers,
> 
> --ted
> 
> Chris Santerre wrote:
> 
> > Well....this kind of goes along the idea of "bayes chains". You can 
> > look into which pairs/treos of bayes tokens hit the most spam and 
> > least ham. Same goes for rules. There are some scripts around the 
> > community to give top hitting rules, which might come in very useful.
> >  
> > Once you find these magic pairs/treos, it should be relativley easy to 
> > meta them together. Although I'm not sure how you would do that on the 
> > bayes token side, as I think it kind of already is handled. Its public 
> > knowledge that I dislike bayes and don't use it :)
> >  
> > Its a good idea, and prbly the next best step to look at.
> >  
> > HTH,
> >  
> >
> > Chris Santerre
> > System Admin and SARE/URIBL Ninja
> > http://www.rulesemporium.com <http://www.rulesemporium.com/>
> > http://www.uribl.com <http://www.uribl.com/>
> >
> >     -----Original Message-----
> >     *From:* Ted Markowitz [mailto:[EMAIL PROTECTED]
> >     *Sent:* Wednesday, May 04, 2005 6:55 PM
> >     *To:* [email protected]
> >     *Subject:* [OT] "Boosting" and other potential research topics
> >
> >     In this same vein of exploring concepts like the application of
> >     boosting algorithms or using meta rulesets to enhance the SA
> >     classification process, I've been looking for an interesting
> >     doctoral dissertation topic in the spam domain for a some time now
> >     and was wondering if folks in the SA community had some ideas
> >     rolling around in the back of their minds that would lend
> >     themselves to doctoral-level research? Perhaps some area you'd
> >     really like to explore yourself, "if only you had the time.":-)
> >
> >     My program in CS is especially geared towards folks with a lot of
> >     hands-on, real world IT experience, and so topics with an applied
> >     research & development bent and a serious coding component are
> >     quite OK. Any ideas, interesting leads, or useful pointers would
> >     be much appreciated.
> >
> >     Thanks muchly for your thoughts.
> >
> >     --ted
> >
> >     Sidney Markowitz wrote:
> >
> >>Fred wrote:
> >>  
> >>
> >>>There was similar work being done in the past to identify rules to be
> >>>grouped into new meta rules, this (w|c)ould achieve similar results.
> >>>http://bugzilla.spamassassin.org/show_bug.cgi?id63
> >>>    
> >>>
> >>
> >>I think I'm missing something here. Are you saying that automatically
> >>grouping rules into meta rules that have similar classification properties
> >>is equivalent to boosting? Or do you mean that it is another approach that
> >>also can improve performance of weak learners?
> >>
> >>In any case, you have given me an idea for the microarray gene expression
> >>problem, so thanks! :-)
> >>
> >> -- sidney
> >>  
> >>
> >
> >-- 
> >
> >===============================================================>Ted Markowitz
> >Chief Architect
> >Cognosys LLC (http://www.cognosys.net)
> >10 Hamilton Lane, Darien, CT 06820-2809, USA
> >----------------------------------------------------------------
> >203-655-2400 (phone/fax)                     203-984-6565 (cell)
> >[EMAIL PROTECTED] (email)                    TJMarkowitz (AIM ID)
> >===============================================================> NOTICE: 
> >This e-mail, including attachments, is intended solely
> > for the person(s) or organization(s) shown in the message's
> > header and may contain confidential and/or legally privileged
> > information.  Any unauthorized disclosure, copying, or other
> > unapproved use or retransmission of this information may be
> > unlawful and is strictly prohibited.  If you are not the
> > intended recipient, please delete this message immediately.
> >===============================================================>
> >    
> >
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFCelV3MJF5cimLx9ARAmumAKCuT1EnKrDlYlZKLx3J+2YKoo+83gCfc+wb
ZthhE6q23GrXnfRFDFr0KBc=
=h+Tr
-----END PGP SIGNATURE-----

Reply via email to