Warren Togami wrote, On 4/11/09 7:17 PM:
My point was lost here. I pasted these URL's as an example of what the spamassassin URI parser might see without decoding
Let's see if I understand this correctly: The message consists of a sequence of bytes which encode characters in a certain charset. When host names and domain names were restricted to 7-bit ASCII then they could be parsed out by SpamAssassin by looking at the raw bytes without regard to the charset. Now we would have to convert the entire byte stream from the raw bytes to wide characters according to the charset of the message before we could be sure to parse and handle text URLs correctly. Does that sum it up?
I haven't paid attention to the issue of charset encoding and wide characters. How much are we getting away with assuming that most emails are in one-byte character codes or at least in codes that represent the ASCII set as one byte and so we can just apply rules to the raw byte strings and it works most of the time? How badly does SPamAsassin fall down if mail is encoded in a charset that violates that assumption?
Clickable link today is not relevant. MUA and browsers in the future will adapt to support these international TLD's.
It is relevant to what we should handle right now versus what we plan to handle in the future when MUAs are changed.
-- sidney
