http://bugzilla.spamassassin.org/show_bug.cgi?id=3661





------- Additional Comments From [EMAIL PROTECTED]  2005-03-13 20:31 -------
Subject: Re:  Request for HTML de-obfuscation of invisible SPAN's

On Sun, Mar 13, 2005 at 06:49:07PM -0800, [EMAIL PROTECTED] wrote:
> I can almost see having two body types: body, which is using the current
> rendering, and cleanbody, which uses the cleaned up rendering.

> I would then expect the development of new, possibly simpler rules, that
> worked off the unabfuscated cleanbody terms.

Well, sort of.  I would actually do the reverse of what you describe.
Most of the rules already don't deal with obfuscation.  They're just
simple phrases.  So I would add either a new rule type (not thrilled
about that), or perhaps a tflag which specifies what type of text the
body rule is supposed to get.  A lot of rules "act now!" are never going
to be hidden, but some things can go either way.

The non-range/T_INT/T_OBFU changes for me are as follows.  From what I could
see, they're all legit:

< 0.0077 0.0000 EMAIL_ROT13
> 0.0000 0.0000 EMAIL_ROT13
< 3.6014 0.0000 LONGWORDS
> 3.3932 0.0000 LONGWORDS
< 0.3856 0.2753 LOTS_OF_STUFF
> 0.3779 0.2753 LOTS_OF_STUFF
< 0.0231 0.0000 OBSCURED_EMAIL
> 0.0077 0.0000 OBSCURED_EMAIL
< 0.8406 0.0000 TRACKER_ID
> 0.3933 0.0000 TRACKER_ID
< 2.0822 0.0000 UNIQUE_WORDS
> 2.0436 0.0000 UNIQUE_WORDS

These all look for obfuscation and such, so I'm not surprised they went down.

< 4.4883 0.0000 DOMAIN_RATIO
> 5.0667 0.0000 DOMAIN_RATIO
< 1.7198 0.8260 UPPERCASE_25_50
> 1.7352 0.8260 UPPERCASE_25_50

I like rules that get better hit rates. :)

< 0.0463 0.0000 HTML_FONT_SIZE_HUGE
> 0.0463 0.0551 HTML_FONT_SIZE_HUGE

This was a bug in the code since the ham hit is legit and should have
been hit before...

> You probably also need to provide an internal term that could be used in a
> meta rule to indicate that invisible text disappeared.  Since you can't
> compare the body and cleanbody text at the rule level, it would otherwise be
> difficult to determine if garbage got removed.

That's not very efficient anyway.  It's already flagged internally what is
visible and what isn't, we would just need to export a general "there was
invisible text" flag.

However, the rule is horrible as a spam detector:

  4.187   4.0333   5.2863    0.433   0.00    0.01  T_HTML_INVIS_TEXT





------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

Reply via email to