encoding="utf8" ignored when parsing XML

Skip Montanaro Tue, 27 Dec 2016 07:13:22 -0800

I am trying to parse some XML which doesn't specify an encoding (Python 2.7.12 
via Anaconda on RH Linux), so it barfs when it encounters non-ASCII data. No 
great surprise there, but I'm having trouble getting it to use another 
encoding. First, I tried specifying the encoding when opening the file:


f = io.open(fname, encoding="utf8")
root = xml.etree.ElementTree.parse(f).getroot()

but that had no effect. Then, when calling xml.etree.ElementTree.parse I 
included an XMLParser object:

parser = xml.etree.ElementTree.XMLParser(encoding="utf8")
root = xml.etree.ElementTree.parse(f, parser=parser).getroot()

Same-o, same-o:

unicode error 'ascii' codec can't encode characters in position 1113-1116: 
ordinal not in range(128)

So, why does it continue to insist on using an ASCII codec? My locale's 
preferred encoding is:

>>> locale.getpreferredencoding()
'ANSI_X3.4-1968'

which I presume is the official way to spell "ascii".

The chardetect command (part of the chardet package) tells me it looks like 
utf8 with high confidence:

% chardetect < ~/tmp/trash
<stdin>: utf-8 with confidence 0.99

I took a look at the code, and tracked the encoding I specified all the way 
down to the creation of the expat parser. What am I missing?

Skip
-- 
https://mail.python.org/mailman/listinfo/python-list

encoding="utf8" ignored when parsing XML

Reply via email to