Hi, I am trying to parse an xml file using the minidom parser. <code> from xml.dom import minidom xmlfilename = "sample.xml" xmldoc = minidom.parse(xmlfilename) </code>
The parser is failing on this line: <mrcb245-c>Heinrich Kèufner, Norbert Nedopil, Heinz Schèoch (Hrsg.).</ mrcb245-c> This is the error message I get: Traceback (most recent call last): File "readXML.py", line 11, in <module> xmldoc = minidom.parse(xmlfilename) File "C:\Python25\lib\xml\dom\minidom.py", line 1913, in parse return expatbuilder.parse(file) File "C:\Python25\lib\xml\dom\expatbuilder.py", line 924, in parse result = builder.parseFile(fp) File "C:\Python25\lib\xml\dom\expatbuilder.py", line 207, in parseFile parser.Parse(buffer, 0) xml.parsers.expat.ExpatError: not well-formed (invalid token): line 2254, column 21 It seems to me that it is having an issue with the 'è' character. I have even tried the following to make sure it recognises the file as utf-8 file: <code> from xml.dom import minidom import codecs xmlfilename = "sample.xml" xmlfile = codecs.open(xmlfilename,"r","utf-8") xmlstring = xmlfile.read() xmldoc = minidom.parse(xmlfilename) </code> However, this doesn't work either and I get the following error message: Traceback (most recent call last): File "readXML.py", line 9, in <module> xmlstring = xmlfile.read() File "C:\Python25\lib\codecs.py", line 618, in read return self.reader.read(size) File "C:\Python25\lib\codecs.py", line 424, in read newchars, decodedbytes = self.decode(data, self.errors) UnicodeDecodeError: 'utf8' codec can't decode bytes in position 69343-69345: invalid data I'm assuming here that it is failing at the same place... Can someone please point me in the right direction? Thanks, Ashmir -- http://mail.python.org/mailman/listinfo/python-list