Martin v. Löwis wrote: >> ci = codecs.lookup("xml-auto-detect") >> p = expat.ParserCreate() >> e = "utf-32" >> s = (u"<?xml version='1.0' encoding=%r?><foo/>" % e).encode(e) >> s = ci.encode(ci.decode(s)[0], encoding="utf-8")[0] >> p.Parse(s, True) > > So how come the document being parsed is recognized as UTF-8?
Because you can force the encoder to use a specified encoding. If you do this and the unicode string starts with an XML declaration, the encoder will put the specified encoding into the declaration: import codecs e = codecs.getencoder("xml-auto-detect") print e(u"<?xml version='1.0' encoding='iso-8859-1'?><foo/>", encoding="utf-8")[0] This prints: <?xml version='1.0' encoding='utf-8'?><foo/> >> OK, so should I put the C code into a _xml module? > > I don't see the need for C code at all. Doing the bit fiddling for Modules/_codecsmodule.c::detect_xml_encoding_str() in C felt like the right thing to do. Servus, Walter _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com