I'll note that 0xFFFF is a UTF-8 non-character, and " these noncharacters should never be included in text interchange between implementations." [1] I assume the OCR engine maybe using 0xFFFF when it can't recognize a character? So, it's not wrong for a parser to complain (or, not complain) about 0xFFFF, and you can just scrub the string like Jon suggests.
Chris [1] http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters#Noncharacters On 5 Mar, 2013, at 9:16 , Jon Stroop <jstr...@princeton.edu> wrote: > Mike, > I haven't used minidom extensively but my guess is that > doc.toprettyxml(indent=" ",encoding="utf-8") isn't actually changing the > encoding because it can't parse the string in your content variable. I'm > surprised that you're not getting tossed a UnicodeError, but The docs for > Node.toxml() [1] might shed some light: > >> To avoid UnicodeError exceptions in case of unrepresentable text data, the >> encoding argument should be specified as “utf-8”. > > So what happens if you're not explicit about the encoding, i.e. just > doc.toprettyxml()? This would hopefully at least move your exception to a > more appropriate place. > > In any case, one solution would be to scrub the string in your content > variable to get rid of the invalid characters (hopefully they're > insignificant). Maybe something like this: > > def unicode_filter(char): > try: > unicode(char, encoding='utf-8', errors='strict') > return char > except UnicodeDecodeError: > return '' > > content = 'abc\xFF' > content = ''.join(map(unicode_filter, content)) > print content > > Not really my area of expertise, but maybe worth a shot.... > -Jon > > 1. > http://docs.python.org/2/library/xml.dom.minidom.html#xml.dom.minidom.Node.toxml > > -- > Jon Stroop > Digital Initiatives Programmer/Analyst > Princeton University Library > jstr...@princeton.edu > > > > > On 03/04/2013 03:00 PM, Michael Beccaria wrote: >> I'm working on a project that takes the ocr data found in a pdf and places >> it in a custom xml file. >> >> I use Python scripts to create the xml file. Something like this (trimmed >> down a bit): >> >> from xml.dom.minidom import Document >> doc = Document() >> Page = doc.createElement("Page") >> doc.appendChild(Page) >> f = StringIO(txt) >> lines = f.readlines() >> for line in lines: >> word = doc.createElement("String") >> ... >> word.setAttribute("CONTENT",content) >> Page.appendChild(word) >> return doc.toprettyxml(indent=" ",encoding="utf-8") >> >> >> This creates a file, simply, that looks like this: >> <?xml version="1.0" encoding="utf-8"?> >> <Page HEIGHT="3296" WIDTH="2609"> >> <String CONTENT="BuffaloLaunch" /> >> <String CONTENT="Club" /> >> <String CONTENT="Offices" /> >> <String CONTENT="Installed" /> >> ... >> </Page> >> >> I am able to get this document to be created ok and saved to an xml file. >> The problem occurs when I try and have it read using the lxml library: >> >> from lxml import etree >> doc = etree.parse(filename) >> >> >> I am running across errors like "XMLSyntaxError: Char 0xFFFF out of allowed >> range, line 94, column 19". Which when I look at the file, is true. There is >> a 0XFFFF character in the content field. >> >> How is a file able to be created using minidom (which I assume would create >> a valid xml file) and then failing when parsing with lxml? What should I do >> to fix this on the encoding side so that errors don't show up on the parsing >> side? >> Thanks, >> Mike >> >> How is the >> Mike Beccaria >> Systems Librarian >> Head of Digital Initiative >> Paul Smith's College >> 518.327.6376 >> mbecca...@paulsmiths.edu >> Become a friend of Paul Smith's Library on Facebook today!