On 08/23/2014 12:07 PM, Eric Shubert wrote:
On 08/23/2014 11:26 AM, Eric Shubert wrote:
It appears that these spams are using random text that's hidden inside
of html in order to beat the bayes filter. At least that's my guess.
I'm guessing that if we write a filter/editor that strips out all
unviewable text from html content in a message before sending it to
sa-learn, the bayes filter will be effective once again.
Thoughts on this? Anyone know of a filter we can pipe messages through
on their way to sa-learn?
It looks as though search engines also consider hidden text to be spam.
http://www.seologic.com/faq/hidden-text
Ok, so all of these that I've examined have
<font color="white">
in them to hide text at the end of the email.
You can quickly check to see if there's hidden text by selecting the
text (it changes color then). Viewing the source will show the technique
that's being used to hide the text. Actually,
<font color="white">
is a pretty unsophisticated technique from what I've read about it.
Fortunately it should be pretty easy to identify as well.
Looking into the SA rules, I see this:
body HTML_FONT_LOW_CONTRAST eval:html_test('font_low_contrast')
describe HTML_FONT_LOW_CONTRAST HTML font color similar or identical to
background
I would expect this to be finding such a thing. This is included in the
/var/lib/spamassassin/3.003002/updates_spamassassin_org/20_html_tests.cf
file.
The Mail::SpamAssassin::Plugin::HTMLEval plugin is loaded according to
--lint.
So now I'm wondering, why isn't this rule firing for these messages? Is
the test so lame that it doesn't pick up the <font color="white"> as
being low contrast? I do see some HTML_FONT_LOW_CONTRAST occurrences in
the spamd log (maillog for me) files, so the rule is firing sometimes.
The scoring is:
score HTML_FONT_LOW_CONTRAST 0.713 0.001 0.786 0.001
That might be lower than it should be, but on these messages I'm seeing,
this rule isn't firing at all. Why not?
I just received two more of these spams. This time, they both use
<div style="color:white">
to hide the (random) text. That's a *little* more sophisticated. Still,
the rule didn't fire.
So I think I'm on the right track with this rule. Just need to figure
out why it's not firing, and probably will need to adjust the scoring
upwards as well.
Stay tuned. (Or dig in yourself if you'd like a little challenge!)
--
-Eric 'shubes'
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]