Re: [CODE4LIB] XML Parsing and Python

Chris Beer Tue, 05 Mar 2013 10:49:25 -0800

I'll note that 0xFFFF is a UTF-8 non-character, and " these noncharacters 
should never be included in text interchange between implementations." [1] I 
assume the OCR engine maybe using 0xFFFF when it can't recognize a character? 
So, it's not wrong for a parser to complain (or, not complain) about 0xFFFF, 
and you can just scrub the string like Jon suggests.


Chris


[1] http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters#Noncharacters

On 5 Mar, 2013, at 9:16 , Jon Stroop <[email protected]> wrote:

> Mike,
> I haven't used minidom extensively but my guess is that 
> doc.toprettyxml(indent=" ",encoding="utf-8") isn't actually changing the 
> encoding because it can't parse the string in your content variable. I'm 
> surprised that you're not getting tossed a UnicodeError, but The docs for 
> Node.toxml() [1] might shed some light:
> 
>> To avoid UnicodeError exceptions in case of unrepresentable text data, the 
>> encoding argument should be specified as “utf-8”.
> 
> So what happens if you're not explicit about the encoding, i.e. just 
> doc.toprettyxml()? This would hopefully at least move your exception to a 
> more appropriate place.
> 
> In any case, one solution would be to scrub the string in your content 
> variable to get rid of the invalid characters (hopefully they're 
> insignificant). Maybe something like this:
> 
> def unicode_filter(char):
>    try:
>        unicode(char, encoding='utf-8', errors='strict')
>        return char
>    except UnicodeDecodeError:
>        return ''
> 
> content = 'abc\xFF'
> content = ''.join(map(unicode_filter, content))
> print content
> 
> Not really my area of expertise, but maybe worth a shot....
> -Jon
> 
> 1. 
> http://docs.python.org/2/library/xml.dom.minidom.html#xml.dom.minidom.Node.toxml
> 
> -- 
> Jon Stroop
> Digital Initiatives Programmer/Analyst
> Princeton University Library
> [email protected]
> 
> 
> 
> 
> On 03/04/2013 03:00 PM, Michael Beccaria wrote:
>> I'm working on a project that takes the ocr data found in a pdf and places 
>> it in a custom xml file.
>> 
>> I use Python scripts to create the xml file. Something like this (trimmed 
>> down a bit):
>> 
>> from xml.dom.minidom import Document
>> doc = Document()
>>      Page = doc.createElement("Page")
>>      doc.appendChild(Page)
>>      f = StringIO(txt)
>>      lines = f.readlines()
>>      for line in lines:
>>      word = doc.createElement("String")
>>              ...
>>              word.setAttribute("CONTENT",content)
>>              Page.appendChild(word)
>>      return doc.toprettyxml(indent="  ",encoding="utf-8")    
>> 
>> 
>> This creates a file, simply, that looks like this:
>> <?xml version="1.0" encoding="utf-8"?>
>> <Page HEIGHT="3296" WIDTH="2609">
>>   <String CONTENT="BuffaloLaunch" />
>>   <String CONTENT="Club" />
>>   <String CONTENT="Offices" />
>>   <String CONTENT="Installed" />
>>   ...
>> </Page>
>> 
>> I am able to get this document to be created ok and saved to an xml file. 
>> The problem occurs when I try and have it read using the lxml library:
>> 
>> from lxml import etree
>> doc = etree.parse(filename)
>> 
>> 
>> I am running across errors like "XMLSyntaxError: Char 0xFFFF out of allowed 
>> range, line 94, column 19". Which when I look at the file, is true. There is 
>> a 0XFFFF character in the content field.
>> 
>> How is a file able to be created using minidom (which I assume would create 
>> a valid xml file) and then failing when parsing with lxml? What should I do 
>> to fix this on the encoding side so that errors don't show up on the parsing 
>> side?
>> Thanks,
>> Mike
>> 
>> How is the
>> Mike Beccaria
>> Systems Librarian
>> Head of Digital Initiative
>> Paul Smith's College
>> 518.327.6376
>> [email protected]
>> Become a friend of Paul Smith's Library on Facebook today!

Re: [CODE4LIB] XML Parsing and Python

Reply via email to