Darcy schrieb: > hi all, i have a newbie problem arising from writing-then-reading a > unicode file, and i can't work out what syntax i need to read it in. > > the syntax i'm using now (just using quick hack tmp files): > BEGIN > f=codecs.open("tt.xml","r","utf8") > fwrap=codecs.EncodedFile(f,"ascii","utf8") > try: > ss=u'' > ss=fwrap.read() > print ss > ## rrr=xml.dom.minidom.parseString(f.read()) # originally > finally: > f.close() > END > > barfs with this error: > BEGIN > UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in > position 5092: ordinal not in range(128) > END > > any ideas?
Your doing things triple-time, which is this time not even half as good: The f=codecs.open("tt.xml","r","utf8") gives you a file that will return unicode objects when reading. And fwrap=codecs.EncodedFile(f,"ascii","utf8") will wrap a normal, non-encoding-aware file to become an encoding aware one. The result is that reading reading from the former already yields a unicode object that is passed to the second wrapper. It will silently pass the unicode-object - but it's useless. And then you try and pass that unicode object of yours to the minidom. But guess what, the minicom parser expects a (byte) string, as it reads the mandatory xml encoding header and will decode the contents itself. So, the passed unicode object is converted to a string beforehand, yielding the exception you see. Just don't do any fancy encoding stuff at all, a simple rrr=xml.dom.minidom.parseString(open("tt.xml").read()) should do. Diez -- http://mail.python.org/mailman/listinfo/python-list