On 11/04/2009 12:34 PM, Sidney Markowitz wrote:
Warren Togami wrote, On 4/11/09 7:17 PM:
My point was lost here. I pasted these URL's as an example of what the
spamassassin URI parser might see without decoding
Let's see if I understand this correctly: The message consists of a
sequence of bytes which encode characters in a certain charset. When
host names and domain names were restricted to 7-bit ASCII then they
could be parsed out by SpamAssassin by looking at the raw bytes without
regard to the charset. Now we would have to convert the entire byte
stream from the raw bytes to wide characters according to the charset of
the message before we could be sure to parse and handle text URLs
correctly. Does that sum it up?
I haven't paid attention to the issue of charset encoding and wide
characters. How much are we getting away with assuming that most emails
are in one-byte character codes or at least in codes that represent the
ASCII set as one byte and so we can just apply rules to the raw byte
strings and it works most of the time? How badly does SPamAsassin fall
down if mail is encoded in a charset that violates that assumption?
Clickable link today is not relevant. MUA and browsers in the future
will adapt to support these international TLD's.
It is relevant to what we should handle right now versus what we plan to
handle in the future when MUAs are changed.
-- sidney
What spamassassin handles right now is fine. Punycode domain names
(without the soon to be ratified IDN TLD's) are rare because clients do
not support it.
I thought this thread was about talking about the future after clients
begin supporting IDN TLD's.
Warren