Thanks for the thoughts, Chris. I've been thinking along some of
these same lines myself as to exactly how much more effective N-gram
phrases (with some arbitrarily N) would be vis a vis the Bayesian
classifier. I seem to remember some research along these lines
described at the 2005 MIT Spam Conference by Jonathan Zdziarski of
DSPAM where he talks about "Bayesian Noise Reduction" using N-gram
sized phrases as "meta-tokens" which can then be fed into some spam/ham
classifier.
Cheers,
--ted
Chris Santerre wrote:
Well....this kind of goes along the idea of "bayes
chains". You can look into which pairs/treos of bayes tokens hit the
most spam and least ham. Same goes for rules. There are some scripts
around the community to give top hitting rules, which might come in
very useful.
Once you find these magic pairs/treos, it should be
relativley easy to meta them together. Although I'm not sure how you
would do that on the bayes token side, as I think it kind of already is
handled. Its public knowledge that I dislike bayes and don't use it :)
Its a good idea, and prbly the next best step to look at.
HTH,
In this same vein of exploring concepts like
the application of boosting algorithms or using meta rulesets to
enhance the SA classification process, I've been looking for an
interesting doctoral dissertation topic in the spam domain for a some
time now and was wondering if folks in the SA community had some ideas
rolling around in the back of their minds that would lend themselves to
doctoral-level research? Perhaps some area you'd really like to explore
yourself, "if only you had the time.":-)
My program in CS is especially geared towards folks with a lot of
hands-on, real world IT experience, and so topics with an applied
research & development bent and a serious coding component are
quite OK. Any ideas, interesting leads, or useful pointers would be
much appreciated.
Thanks muchly for your thoughts.
--ted
Sidney Markowitz wrote:
Fred wrote:
There was similar work being done in the past to identify rules to be
grouped into new meta rules, this (w|c)ould achieve similar results.
http://bugzilla.spamassassin.org/show_bug.cgi?id=1363
I think I'm missing something here. Are you saying that automatically
grouping rules into meta rules that have similar classification properties
is equivalent to boosting? Or do you mean that it is another approach that
also can improve performance of weak learners?
In any case, you have given me an idea for the microarray gene _expression_
problem, so thanks! :-)
-- sidney
--
================================================================
Ted Markowitz
Chief Architect
Cognosys LLC (http://www.cognosys.net)
10 Hamilton Lane, Darien, CT 06820-2809, USA
----------------------------------------------------------------
203-655-2400 (phone/fax) 203-984-6565 (cell)
[EMAIL PROTECTED] (email) TJMarkowitz (AIM ID)
================================================================
NOTICE: This e-mail, including attachments, is intended solely
for the person(s) or organization(s) shown in the message's
header and may contain confidential and/or legally privileged
information. Any unauthorized disclosure, copying, or other
unapproved use or retransmission of this information may be
unlawful and is strictly prohibited. If you are not the
intended recipient, please delete this message immediately.
================================================================
--
================================================================
Ted Markowitz
Chief Architect
Cognosys LLC (http://www.cognosys.net)
10 Hamilton Lane, Darien, CT 06820-2809, USA
----------------------------------------------------------------
203-655-2400 (phone/fax) 203-984-6565 (cell)
[EMAIL PROTECTED] (email) TJMarkowitz (AIM ID)
================================================================
NOTICE: This e-mail, including attachments, is intended solely
for the person(s) or organization(s) shown in the message's
header and may contain confidential and/or legally privileged
information. Any unauthorized disclosure, copying, or other
unapproved use or retransmission of this information may be
unlawful and is strictly prohibited. If you are not the
intended recipient, please delete this message immediately.
================================================================
|
begin:vcard
fn:Ted Markowitz
n:Markowitz;Ted
org:Cognosys LLC
adr:;;10 Hamilton Lane;Darien;CT;06820-2809;USA
email;internet:[EMAIL PROTECTED]
title:Chief Architect
tel;work:203-655-2400
tel;fax:203-655-2400
tel;cell:203-984-6565
x-mozilla-html:TRUE
url:http://www.cognosys.net
version:2.1
end:vcard
smime.p7s
Description: S/MIME Cryptographic Signature