Re: suppressing bad characters in output PCDATA (converting JSON to XML)

Stefan Behnel Tue, 29 Nov 2011 06:37:31 -0800

Adam Funk, 29.11.2011 13:57:

On 2011-11-28, Stefan Behnel wrote:

Adam Funk, 25.11.2011 14:50:

Then I recurse through the contents of big_json to build an instance
of xml.dom.minidom.Document (the recursion includes some code to
rewrite dict keys as valid element names if necessary)


If the name "big_json" is supposed to hint at a large set of data, you may
want to use something other than minidom. Take a look at the
xml.etree.cElementTree module instead, which is substantially more memory
efficient.


Well, the input file in this case contains one big JSON list of
reasonably sized elements, each of which I'm turning into a separate
XML file.  The output files range from 600 to 6000 bytes.

It's also substantially easier to use, but if your XML writing code worksalready, why change it.

and I save the document:

xml_file = codecs.open(output_fullpath, 'w', encoding='UTF-8', errors='replace')
doc.writexml(xml_file, encoding='UTF-8')
xml_file.close()


Same mistakes as above. Especially the double encoding is both unnecessary
and likely to fail. This is also most likely the source of your problems.


Well actually, I had the problem with the occasional control
characters in the output *before* I started sticking encoding="UTF-8"
all over the place (in an unsuccessful attempt to beat them down).


You should read up on Unicode a bit.

I thought this would force all the output to be valid, but xmlstarlet
gives some errors like these on a few documents:

PCDATA invalid Char value 7
PCDATA invalid Char value 31


This strongly hints at a broken encoding, which can easily be triggered by
your erroneous encode-and-encode cycles above.


No, I've checked the JSON input and those exact control characters are
there too.


Ah, right, I didn't look closely enough. Those are forbidden in XML:

http://www.w3.org/TR/REC-xml/#charsets

It's sad that minidom (apparently) lets them pass through without even awarning.

I want to suppress them (delete or replace with spaces).

Ok, then you need to process your string content while creating XML fromit. If replacing is enough, take a look at string.maketrans() in the stringmodule and str.translate(), a method on strings. Or maybe just use aregular expression that matches any whitespace character and replace itwith a space. Or whatever suits your data best.

Also, the kind of problem you present here makes it pretty clear that you
are using Python 2.x. In Python 3, you'd get the appropriate exceptions
when trying to write binary data to a Unicode file.


Sorry, I forgot to mention the version I'm using, which is "2.7.2+".


Yep, Py2 makes Unicode handling harder than it should be.

Stefan

--
http://mail.python.org/mailman/listinfo/python-list

Re: suppressing bad characters in output PCDATA (converting JSON to XML)

Reply via email to