On 11/03/2009 08:42 PM, Sidney Markowitz wrote:

However, what does this mean for detecting URLs in plain text messages
in which a URL string can be in a non-ASCII charset and MUAs might
(eventually) parse them as URLs?


It seems clear that we will need to flatten/encode any URI domain to punycode for URIBL lookups.

http://search.cpan.org/search?query=punycode&mode=all
There exist some Punycode handling libs in CPAN. We might be best off standardizing on a particular library so URIBL's can use the same methodology for encoding their punycode listings.

The unclear part is if we will need to decode URI's prior to punycode encoding. I suspect we will be forced to decode. Why?

* Encoding punycode with binary garbage input might be poorly defined and unstandardized? * Some spamassassin using sites decode everything by preference while most others do not decode. This means you could be querying URIBL's with two different flattened punycode strings?

Please correct me if my understanding is incorrect.

Warren Togami
[email protected]

Reply via email to