Re: Non-Roman characters in TLDs and domain names

Warren Togami Tue, 03 Nov 2009 18:28:24 -0800

On 11/03/2009 08:42 PM, Sidney Markowitz wrote:


However, what does this mean for detecting URLs in plain text messages
in which a URL string can be in a non-ASCII charset and MUAs might
(eventually) parse them as URLs?

It seems clear that we will need to flatten/encode any URI domain topunycode for URIBL lookups.


http://search.cpan.org/search?query=punycode&mode=all

There exist some Punycode handling libs in CPAN. We might be best offstandardizing on a particular library so URIBL's can use the samemethodology for encoding their punycode listings.

The unclear part is if we will need to decode URI's prior to punycodeencoding. I suspect we will be forced to decode. Why?

* Encoding punycode with binary garbage input might be poorly definedand unstandardized?* Some spamassassin using sites decode everything by preference whilemost others do not decode. This means you could be querying URIBL'swith two different flattened punycode strings?


Please correct me if my understanding is incorrect.

Warren Togami
[email protected]

Re: Non-Roman characters in TLDs and domain names

Reply via email to