Re: Non-Roman characters in TLDs and domain names

Sidney Markowitz Wed, 04 Nov 2009 09:35:48 -0800

Warren Togami wrote, On 4/11/09 7:17 PM:

My point was lost here. I pasted these URL's as an example of what thespamassassin URI parser might see without decoding

Let's see if I understand this correctly: The message consists of asequence of bytes which encode characters in a certain charset. Whenhost names and domain names were restricted to 7-bit ASCII then theycould be parsed out by SpamAssassin by looking at the raw bytes withoutregard to the charset. Now we would have to convert the entire bytestream from the raw bytes to wide characters according to the charset ofthe message before we could be sure to parse and handle text URLscorrectly. Does that sum it up?

I haven't paid attention to the issue of charset encoding and widecharacters. How much are we getting away with assuming that most emailsare in one-byte character codes or at least in codes that represent theASCII set as one byte and so we can just apply rules to the raw bytestrings and it works most of the time? How badly does SPamAsassin falldown if mail is encoded in a charset that violates that assumption?

Clickable link today is not relevant. MUA and browsers in the futurewill adapt to support these international TLD's.

It is relevant to what we should handle right now versus what we plan tohandle in the future when MUAs are changed.


 -- sidney

Re: Non-Roman characters in TLDs and domain names

Reply via email to