http://bugzilla.spamassassin.org/show_bug.cgi?id=3661
------- Additional Comments From [EMAIL PROTECTED] 2005-03-13 20:31 ------- Subject: Re: Request for HTML de-obfuscation of invisible SPAN's On Sun, Mar 13, 2005 at 06:49:07PM -0800, [EMAIL PROTECTED] wrote: > I can almost see having two body types: body, which is using the current > rendering, and cleanbody, which uses the cleaned up rendering. > I would then expect the development of new, possibly simpler rules, that > worked off the unabfuscated cleanbody terms. Well, sort of. I would actually do the reverse of what you describe. Most of the rules already don't deal with obfuscation. They're just simple phrases. So I would add either a new rule type (not thrilled about that), or perhaps a tflag which specifies what type of text the body rule is supposed to get. A lot of rules "act now!" are never going to be hidden, but some things can go either way. The non-range/T_INT/T_OBFU changes for me are as follows. From what I could see, they're all legit: < 0.0077 0.0000 EMAIL_ROT13 > 0.0000 0.0000 EMAIL_ROT13 < 3.6014 0.0000 LONGWORDS > 3.3932 0.0000 LONGWORDS < 0.3856 0.2753 LOTS_OF_STUFF > 0.3779 0.2753 LOTS_OF_STUFF < 0.0231 0.0000 OBSCURED_EMAIL > 0.0077 0.0000 OBSCURED_EMAIL < 0.8406 0.0000 TRACKER_ID > 0.3933 0.0000 TRACKER_ID < 2.0822 0.0000 UNIQUE_WORDS > 2.0436 0.0000 UNIQUE_WORDS These all look for obfuscation and such, so I'm not surprised they went down. < 4.4883 0.0000 DOMAIN_RATIO > 5.0667 0.0000 DOMAIN_RATIO < 1.7198 0.8260 UPPERCASE_25_50 > 1.7352 0.8260 UPPERCASE_25_50 I like rules that get better hit rates. :) < 0.0463 0.0000 HTML_FONT_SIZE_HUGE > 0.0463 0.0551 HTML_FONT_SIZE_HUGE This was a bug in the code since the ham hit is legit and should have been hit before... > You probably also need to provide an internal term that could be used in a > meta rule to indicate that invisible text disappeared. Since you can't > compare the body and cleanbody text at the rule level, it would otherwise be > difficult to determine if garbage got removed. That's not very efficient anyway. It's already flagged internally what is visible and what isn't, we would just need to export a general "there was invisible text" flag. However, the rule is horrible as a spam detector: 4.187 4.0333 5.2863 0.433 0.00 0.01 T_HTML_INVIS_TEXT ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee.
