http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5041





------- Additional Comments From [EMAIL PROTECTED]  2006-08-10 21:52 -------
> non-text noise parts

Exactly, and it produces an array of "short" lines that are limited to 2048
characters to prevent overloading rules without them having to deal with line
length individually. Perhaps we should say that message text is logically a set
of words, and instead of an array of short lines produce an array of short 
words.

Ok, I've thought about this a bit more and I'm leaving the previous paragraph I
typed to provide context for how I'm thinking about this: The first attached
example has a block of 76 character lines with no spaces. If we don't want to
break up long URLs there may not be a "short" word length that would do us any
good. Even worse, looking at the second example, with the uuencoded block, the
lines are only 64 or so characters long and there are some embedded spaces, yet
it still takes too long to process. What we haven't done is profile the slow
rules to see just what the bottleneck(s) is/are. If there is a bottleneck common
to all of those rules, once we know what it is we can either do something in
message body processing or come up with some standard thing to do in such rules
to avoid it.

If we can't do that we may be stuck with figuring out a heuristic for detecting
BASE64 and uuencoded blocks and not pass them through into the message body
array -- But then we have to be very careful that spammers can't trick the
parser to get it to allow text through that will render ok on the mail client.




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

Reply via email to