On 4/7/2012 2:59 AM, Tim Bannister wrote: > On 7 Apr 2012, at 07:33, William A. Rowe Jr. wrote: > >> So we have live registrars, no longer "experimental", who are now >> registering domains in punycode. Make of it what you will. >> >> Do we want to recognize non-ASCII strings in the ServerName|Alias directives >> as utf-8 -> punycode encodings? Internally, from the time the servername >> field is assigned, it can be an ascii mapping. > > I think this is more important for mass virtual hosting (VirtualDocumentRoot > from mod_vhost_alias, etc). Users would create a document root directory > named, eg, テスト.example and expect it to work. They don't know anything about > Unicode, let alone punycode. > I reckon a lot of users would work out quickly that only Roman characters > work in domain names, but they aren't going to be able to work out how to > rename that folder into the correct punycode nor to tell the folders apart if > renamed in this way. > > As a user: I already have a configuration file with a UTF-8 ServerAlias > defined, that's just waiting for httpd to implement this feature … and until > then, I have the punycoded version in there as well.
I've spent a bit more time on this. The obvious issue of ambiguious domain registrations is being handled on a registrar-by-registrar basis, and you can get a nice summary of the punycode entries accepted by various registrars here; http://www.mozilla.org/projects/security/tld-idn-policy-list.html In thinking about what punycode is dangerous to represent, I can't come up with any within the context of httpd. 1. User VirtualHost ServerName/ServerAlias entries, or mod_vhost_alias entries. These are controlled by the administrator, not affected by the remote client. Provided that client provided non-ASCII domains are refused, then punycode can be represented as UTF-8 in our access and error logs, server config directives and so forth when referring to the locally configured domain names. We should always present these in things like mod_info and httpd -D DUMP_VHOSTS as name(punyname) to help the administrator to untangle any confusion. 2. Location: headers and automated self-url references should must present the punycode url in href= and other header fields, but may present the utf-8 in the presentation context such as error pages or autoindexes, etc. Whatever the W3C has to say about this in HTML5 is irrelevant if we don't know whether the user agent supports utf-8 -> punycode transliteration. What is less clear is what precautions we should take when functioning as a forward proxy with proxy uri string contents, or presenting user-provided, non-canonicalized host names. I can imagine such translation being abused to conceal some forms of XSS exploitation. I'd start by assembling a patch to introduce punycode transliteration into the apr-util library and another patch into httpd for vhost, mass-vhosting using utf-8 path names, and presenting trusted utf-8 values for our error log and field tokens. Does anyone have concerns before I begin messing with this logic?
