R. Scott Perry wrote:

The problem is that it is nearly impossible to determine which are valid HTML tags and which are not -- that would require a database of known good HTML tags, which would need to be constantly updated.


This was the first filter that I tried writing actually :) I got a list of valid HTML tags and subtracted them from a list of two letter codes that I had, i.e. "<aa", <ab", "<ac", etc. The problem is that you can define your own tags with XML and call them anything you want (and that might not be all of it). It was of course a fairly hefty filter as well. That led me to the idea of just going after two letter character combinations which were not in the dictionary. Maybe I can revisit that filter now by limiting the characters used to just the 15 most common letters (just 225 combination that cover probably 80% of dictionary words), and counterbalancing with some stuff that detects XML (which I hadn't thought of back then).

This would work on both gibberish as well as dictionary randomization.

The problem that has been appearing with more frequency as of late though is randomization with punctuation, mostly periods, but other characters as well. Periods of course are problematic because of too many legit uses in domain names and other things which can appear in E-mail. This stuff is all very processor intensive, so I've been avoiding it until I have a better handle on my other filters.

Generally I can delete a piece of spam or pass an E-mail with a peak of about 10%-15% of my processor, however a non-spam 32K text message without attachments can drive both processors at an average of 80% for up to 5 seconds. I expect that the END functionality will help a great deal in those situations, but I'm also looking elsewhere to save. Just by reordering my filters, I think I saved about half of the processing power required on average after previously cutting things down with SKIPIFWEIGHT.

Matt

---
[This E-mail was scanned for viruses by Declude Virus (http://www.declude.com)]

---
This E-mail came from the Declude.JunkMail mailing list.  To
unsubscribe, just send an E-mail to [EMAIL PROTECTED], and
type "unsubscribe Declude.JunkMail".  The archives can be found
at http://www.mail-archive.com.

Reply via email to