-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
there's been quite a bit of research into N-gram bayesian phrases; I'd recommend reading the spambayes list archives, the bogofilter archives, and I think Gordon Cormack covered its accuracy too. summary: you'll massively expand database size for not a huge gain, iirc ;) Dobly noise reduction is slightly different though. - --j. Ted Markowitz writes: > Thanks for the thoughts, Chris. I've been thinking along some of these > same lines myself as to exactly how much more effective N-gram phrases > (with some arbitrarily N) would be vis a vis the Bayesian classifier. I > seem to remember some research along these lines described at the 2005 > MIT Spam Conference by Jonathan Zdziarski of DSPAM where he talks about > "Bayesian Noise Reduction" using N-gram sized phrases as "meta-tokens" > which can then be fed into some spam/ham classifier. > > Cheers, > > --ted > > Chris Santerre wrote: > > > Well....this kind of goes along the idea of "bayes chains". You can > > look into which pairs/treos of bayes tokens hit the most spam and > > least ham. Same goes for rules. There are some scripts around the > > community to give top hitting rules, which might come in very useful. > > > > Once you find these magic pairs/treos, it should be relativley easy to > > meta them together. Although I'm not sure how you would do that on the > > bayes token side, as I think it kind of already is handled. Its public > > knowledge that I dislike bayes and don't use it :) > > > > Its a good idea, and prbly the next best step to look at. > > > > HTH, > > > > > > Chris Santerre > > System Admin and SARE/URIBL Ninja > > http://www.rulesemporium.com <http://www.rulesemporium.com/> > > http://www.uribl.com <http://www.uribl.com/> > > > > -----Original Message----- > > *From:* Ted Markowitz [mailto:[EMAIL PROTECTED] > > *Sent:* Wednesday, May 04, 2005 6:55 PM > > *To:* [email protected] > > *Subject:* [OT] "Boosting" and other potential research topics > > > > In this same vein of exploring concepts like the application of > > boosting algorithms or using meta rulesets to enhance the SA > > classification process, I've been looking for an interesting > > doctoral dissertation topic in the spam domain for a some time now > > and was wondering if folks in the SA community had some ideas > > rolling around in the back of their minds that would lend > > themselves to doctoral-level research? Perhaps some area you'd > > really like to explore yourself, "if only you had the time.":-) > > > > My program in CS is especially geared towards folks with a lot of > > hands-on, real world IT experience, and so topics with an applied > > research & development bent and a serious coding component are > > quite OK. Any ideas, interesting leads, or useful pointers would > > be much appreciated. > > > > Thanks muchly for your thoughts. > > > > --ted > > > > Sidney Markowitz wrote: > > > >>Fred wrote: > >> > >> > >>>There was similar work being done in the past to identify rules to be > >>>grouped into new meta rules, this (w|c)ould achieve similar results. > >>>http://bugzilla.spamassassin.org/show_bug.cgi?id63 > >>> > >>> > >> > >>I think I'm missing something here. Are you saying that automatically > >>grouping rules into meta rules that have similar classification properties > >>is equivalent to boosting? Or do you mean that it is another approach that > >>also can improve performance of weak learners? > >> > >>In any case, you have given me an idea for the microarray gene expression > >>problem, so thanks! :-) > >> > >> -- sidney > >> > >> > > > >-- > > > >===============================================================>Ted Markowitz > >Chief Architect > >Cognosys LLC (http://www.cognosys.net) > >10 Hamilton Lane, Darien, CT 06820-2809, USA > >---------------------------------------------------------------- > >203-655-2400 (phone/fax) 203-984-6565 (cell) > >[EMAIL PROTECTED] (email) TJMarkowitz (AIM ID) > >===============================================================> NOTICE: > >This e-mail, including attachments, is intended solely > > for the person(s) or organization(s) shown in the message's > > header and may contain confidential and/or legally privileged > > information. Any unauthorized disclosure, copying, or other > > unapproved use or retransmission of this information may be > > unlawful and is strictly prohibited. If you are not the > > intended recipient, please delete this message immediately. > >===============================================================> > > > > -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.5 (GNU/Linux) Comment: Exmh CVS iD8DBQFCelV3MJF5cimLx9ARAmumAKCuT1EnKrDlYlZKLx3J+2YKoo+83gCfc+wb ZthhE6q23GrXnfRFDFr0KBc= =h+Tr -----END PGP SIGNATURE-----
