New submission from David Watson <bai...@users.sourceforge.net>: The functions in the socket module which return host/domain names, such as gethostbyaddr() and getnameinfo(), are wrappers around byte-oriented interfaces but return Unicode strings in 3.x, and have not been updated to deal with undecodable byte sequences in the results, as discussed in PEP 383.
Some DNS resolvers do discard hostnames not matching the ASCII-only RFC 1123 syntax, but checks for this may be absent or turned off, and non-ASCII bytes can be returned via other lookup facilities such as /etc/hosts. Currently, names are converted to str objects using PyUnicode_FromString(), i.e. by attempting to decode them as UTF-8. This can fail with UnicodeError of course, but even if it succeeds, any non-ASCII names returned will fail to round-trip correctly because most socket functions encode string arguments into IDNA ASCII-compatible form before using them. For example, with UTF-8 encoded entries 127.0.0.2 € 127.0.0.3 xn--lzg in /etc/hosts, I get: Python 3.1.2 (r312:79147, Mar 23 2010, 19:02:21) [GCC 4.2.4 (Ubuntu 4.2.4-1ubuntu4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from socket import * >>> getnameinfo(("127.0.0.2", 0), 0) ('€', '0') >>> getaddrinfo(*_) [(2, 1, 6, '', ('127.0.0.3', 0)), (2, 2, 17, '', ('127.0.0.3', 0)), (2, 3, 0, '', ('127.0.0.3', 0))] Here, getaddrinfo() has encoded "€" to its corresponding ACE label "xn--lzg", which maps to a different address. PEP 383 can't be applied as-is here, since if the name happened to be decodable in the file system encoding (and thus was returned as valid non-ASCII Unicode), the result would fail to round-trip correctly as shown above, but I think there is a solution which follows the general idea of PEP 383. Surrogate characters are not allowed in IDNs, since they are prohibited by Nameprep[1][2], so if names were instead decoded as ASCII with the surrogateescape error handler, strings representing non-ASCII names would always contain surrogate characters representing the non-ASCII bytes, and would therefore fail to encode with the IDNA codec. Thus there would be no ambiguity between these strings and valid IDNs. The attached ascii-surrogateescape.diff does this. The returned strings could then be made to round-trip as arguments, by having functions that take hostname arguments attempt to encode them using ASCII/surrogateescape first before trying IDNA encoding. Since IDNA leaves ASCII names unchanged and surrogate characters are not allowed in IDNs, this would not change the interpretation of any string hostnames that are currently accepted. It would remove the 63-octet limit on label length currently imposed due to the IDNA encoding, for ASCII names only, but since this is imposed due to the 63-octet limit of the DNS, and non-IDN names may be intended for other resolution mechanisms, I think this is a feature, not a bug :) The patch try-surrogateescape-first.diff implements the above for all relevant interfaces, including gethostbyaddr() and getnameinfo(), which do currently accept hostnames, even if the documentation is vague (in the standard library, socket.fqdn() calls gethostbyaddr() with a hostname, and the "os" module docs suggest calling socket.gethostbyaddr(socket.gethostname()) to get the fully-qualified hostname). The patch still allows hostnames to be passed as bytes objects, but to simplify the implementation, it removes support for bytearray (as has been done for pathnames in 3.2). Bytearrays are currently only accepted by the socket object methods (.connect(), etc.), and this is undocumented and perhaps unintentional - the get*() functions have never accepted them. One problem with the surrogateescape scheme would be with existing code that looks up an address and then tries to write the hostname to a log file or use it as part of the wire protocol, since the surrogate characters would fail to encode as ASCII or UTF-8, but the code would appear to work normally until it encountered a non-ASCII hostname, allowing the problem to go undetected. On the other hand, such code is probably broken as things stand, given that the address lookup functions can undocumentedly raise UnicodeError in the same situation. Also, protocol definitions often specify some variant of the RFC 1123 syntax for hostnames (thus making non-ASCII bytes illegal), so code that checked for this prior to encoding the name would probably be OK, but it's more likely the exception than the rule. An alternative approach might be to return all hostnames as bytes objects, thus breaking everything immediately and obviously... [1] http://tools.ietf.org/html/rfc3491#section-5 [2] http://tools.ietf.org/html/rfc3454#appendix-C.5 ---------- components: Extension Modules files: ascii-surrogateescape.diff keywords: patch messages: 111550 nosy: baikie priority: normal severity: normal status: open title: socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names type: behavior versions: Python 3.2 Added file: http://bugs.python.org/file18195/ascii-surrogateescape.diff _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue9377> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com