[This is a repost excerpted from two messages I sent to the list. I just
discovered that my email settings were left incorrect after I recovered
from a hard disk crash. I apologize for the redundancy if the other two
messages are just stuck instead of lost and you end up seeing them.]

I'm bringing this up on dev list to get some discussion of the technical
issues involved before opening a Bugzilla issue for it.

News of an ICANN decision to allow international character
sets in domain names was reported last week, for example, in this article:

 http://www.voanews.com/english/2009-10-30-voa14.cfm

The article doesn't have much technical detail, but does say that there will be new TLDs "by the end of the year" which is less than two months away.

I'm concerned that it might have a big impact on SpamAssassin's parsing
of headers and URLs.

Further digging found this:

http://idn.icann.org/E-mail_test

which seems to imply that email will use the A-label encoding of IDN for
email addresses, which converts charset encoded characters into encoded
ASCII strings from the alphabet a through z and the hyphen character,
with a prefix of "xn--". As far as I can tell from the examples there
will be new TLDs that will have to be A-label encoded.

I think this means that there will not need to be a major change to
SpamAsassin regarding parsing of headers in which A-label encoding is
required. Where we now have routines that check for valid TLDs
looking for .com, .org, .us, .kr, etc., we will simply have to add some
new TLDs to the list. They will still be specific fixed ASCII strings,
just that there will be new TLDs that look like ".xn--deba0ad"

However, what does this mean for detecting URLs in plain text messages
in which a URL string can be in a non-ASCII charset and MUAs might (eventually) parse them as URLs?

  -- sidney

Reply via email to