http://bugzilla.spamassassin.org/show_bug.cgi?id=3661





------- Additional Comments From [EMAIL PROTECTED]  2005-03-13 23:39 -------
Subject: Re:  Request for HTML de-obfuscation of invisible SPAN's

> > I can almost see having two body types: body, which is using the current
> > rendering, and cleanbody, which uses the cleaned up rendering.
>
> > I would then expect the development of new, possibly simpler rules, that
> > worked off the unabfuscated cleanbody terms.
>
> Well, sort of.  I would actually do the reverse of what you describe.

I had been thinking in terms of a new rule type for the cleaned body, so as
to not break the current rules.  However, in light if it seemingly not being
that many rules, I suppose it doesn't matter much.  (I'm a little queasy
though about possibly lots of 3rd party rules suddenly 'breaking' though.)

> So I would add either a new rule type (not thrilled
> about that), or perhaps a tflag which specifies what type of text the
> body rule is supposed to get.

In passing, I don't understand the general reluctance to add new rule types.
To me this seems incredibly cleaner and more obvious than crufting things up
with overloaded meanings by using cryptic flags that people will forget how
to spell.

Did SA at some point in the past have dozens of rule types and go through a
cleanup phase?  Or was there some other bad experience in the past with too
many rule types?  (My experience with SA only goes back to 2.6, so I don't
know ancient history.)

>From where I sit looking at 2.6/3.0, I'd personally vote in favor of
doubling or tripling the number of available rule types before I even
thought about being concerned at the number of different types.  (But then,
I'd also be inclined to code PMS to not generate most of the rule type
sources unless it was known that there was at least one test on that rule
type.  This seems pretty trivial to do the way the main rules evaluation
works, last time I looked.)


> That's not very efficient anyway.  It's already flagged internally what is
> visible and what isn't, we would just need to export a general "there was
> invisible text" flag.

I think we are both saying the same thing, I just didn't use the right
words.


> However, the rule is horrible as a spam detector:
>
>   4.187   4.0333   5.2863    0.433   0.00    0.01  T_HTML_INVIS_TEXT

That amazes me.  I wonder what kind of things are invisible in that ham
mail?  Are these newsletter type things, or Word HTML output?  Or is there
just normally some hidden text in most all HTML?

Maybe there are 2-3 really common hidden things in HTML, and after excepting
them, the results would improve?





------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

Reply via email to