http://bugzilla.spamassassin.org/show_bug.cgi?id=3139
------- Additional Comments From [EMAIL PROTECTED] 2005-06-05 08:56 -------
Subject: Re: RFE: ignore invisible text during rendering
In the past this 'bayes poisioning' has actually proven to be a very good
indicator of spam, in general. The sort of thing most spammers put in their
mail as poisoning simply doesn't match the words commonly in use by the
recipients, and thus nicely classifies the mail as spam.
There has been moderate discussion of eliminating invisible text in various
contexts over the past few months; the idea isn't new and there is an open
big (maybe this one) on the concept. Part of the question, once one gets
past the 'should we do it at all?' question, becomes 'where should we do it,
and under what circumstances?'
There are several possibilities:
a) eliminate invisible or near-invisible text entirely. *
b) eliminate it in bayes
c) eliminate it as a rule source
d) make visible and invisible rules, and have the invisible text only
available to special rules
e) make visible rules and full rules, with the invisible text left in
position in the full rules**
* Deciding what is invisible isn't as simple as 0pt or display:none. Below
a certain point size, or almost any size with the right combination of
foreground and background colors (not necessarily the same) can be
invisible. This is a human physiology question to an extent. Spammers can
decide the colors, font faces and sizes by experimentation. Determining
algorithmically what the spammers have achieved by experimentation may be
nontrivial.
(Of course, one can always start with the trivial cases, since they will
handle most of today's spam.)
** "full rules" in this context is not referring to the current full rules,
it is more referring to the current body rules that show rendered text, but
with the visible and near-visible or invisible text still left in position.
As a rule writer rather than an SA implementor, I personally favor the
following, at least with current spam techniques:
1) leave the invisible text in bayes. I think someone did a test that
indicated that it was a better spam sign if left as currently rendered than
if stripped out into separate tokens from the visible text. However, this
experiment might be revisited to determine the best hit ratio.
2) Make the following rule base types. (This covers some other
complaints I have as a rule writer also) (Some of these already exist):
a pristine email message
b headers
c mime headers
d decoded body sections (un-base64, etc)
e rendered body sections, keeping invisible text
f rendered body sections, deleting invisible text
g anchor rules. These have two parts - the uri and the anchor
h uri rules
Obviously types a, b, e, and h already exist in usable form. Type d exists,
but is largely unusable as it breaks the text by line rather than as a
section. (For that matter, type e has problems in some cases, as it breaks
by paragraph and other random places.)
So the new types would be a mime header type, the visible-rendered type, and
the anchor-rule type.
Once the rule base types existed, there would be a formidable effort of
taking the existing body rules and determining which ones should be
rendered-body rules and which ones should be rendered-visible rules to get
the best results. This is a sub-project that SARE members would probably
happily take on to relieve the dev's of having to do all of the work. For
that matter, if rawbody was fixed to return body parts rather than lines, a
lot of existing rules that currently exist as both rawbody/full or full/uri
or the like could be reduced to simple rawbody rules with improved hit
rates.
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.