[android-developers] Re: SAXParser throws exception for bad character in CDATA block, bug???

Bob Kerns Thu, 21 Apr 2011 17:30:56 -0700

To extend what Jens said -- and sorry for this sounding like a rant, but I 
think this is a very important point that is often misunderstood, and often 
causes a LOT of harm. Don't worry, I'm not ranting at you or any person in 
particular.


The UTF-8 encoding of the character in question is 0xE2 0x80 0x99.

0xE2 is â <http://en.wikipedia.org/wiki/%C3%82> -- a legitimate character in 
ISO-8859-1.

0x80 and 0x99 are both nonsense. As Jens points out, CDATA is not "binary 
data". If you want to encode binary data, convert it to base64 or something 
 similar, and dump it into a CDATA section.

Frankly, I think you have to be nuts to be sending ISO-8851-1 XML. Every 
single time I've encountered ISO-8859-1 XML being produced, I've regarded it 
as a bug, and fixed it. There is absolutely no reason, ever, to produce such 
a thing. You're doing it wrong if you write it, and you're doing it wrong if 
that's all you can read. UTF-8 is always acceptable by a conforming reader. 
ISO-8859-1 is not required to be supported by any conforming reader -- and 
never has been.

Clearly, here, someone wanted to include a character not in ISO-8859-1 -- as 
they should have every expectation of being able to do. But whoever was 
doing the actual output is doing some sleight-of-hand, fake-XML stuff, 
rather than using an actual XML library. Any reasonable XML library is going 
to refuse to write that character (by throwing an exception) if told to 
write it to an ISO-8859-1 document.

I counsel people to NEVER hand-construct XML -- always use an API that is 
incapable of producing malformed XML. Not only is it quite difficult to get 
right, and always get it right, when constructing by hand -- the practice 
simply does not scale. The more XML you hand-produce, the more certain it 
becomes that somewhere, somehow, you will have some edge case you failed to 
consider, and it becomes a virtual certainty that you will introduce a bug 
(often, this very sort of bug).

Bottom line -- it is wrong to write ISO-8859-1 XML without the consent of 
the recipient (which is never possible in an RSS feed), it is wrong to do so 
because it will fail -- somewhere, sometime, someone is sure to use a 
non-ISO-8859-1 character, it is wrong to send non-text (i.e. invalid data 
bytes in the chosen encoding) in CDATA, it is wrong to stick UTF-8 bytes in 
a  CDATA in an ISO-8859-1 document.

So I have a serious problem with Jens' solution, and you should think very 
carefully about it before adopting it, especially applying it to other 
contexts where this objection may be more serious.

Jens' "solution" is really, take broken data, and silently corrupting it 
further. I have seen entire databases mangled by applying this philosophy -- 
there was a bug in the system, and it used some library (often the 
programmer was unaware of this behavior), that took anything it didn't 
understand, and turned it into '?' or some other fixed character, or, as 
here, dropped it entirely.

The world is filled with databases and files which have had this sort of 
data corruption applied, and once it is done, there is no hope whatsoever if 
repair.

It is in most cases, far better to instantly, irrevocably, fail. Don't let 
that bad data get any further.

If you're reading ISO-8859-1, and you find something that's not one of the 
191 characters in ISO-8859-1, you have data corruption. Stop there.
If you're reading UTF-8, and you find something that's not properly-encoded 
UTF-8, again, you have data corruption. Stop before it gets worse.

Whether that applies here, depends on what you're planning to do with the 
data you're collecting. If it's ephemeral, Jens' solution may be appropriate 
-- the consequence of propagating corrupt data may be less than obscuring 
what information the user could glean from the rubbish.

But if you're persisting the data, or the data has any serious consequence 
-- think very carefully before taking any action other than throwing an 
error.

That's what the SAX parser is doing here, and it is entirely correct in 
taking this stance.

Always use UTF-8.
Always use an XML library for both input and output.

This really is a place where it pays to be dogmatic.

-- 
You received this message because you are subscribed to the Google
Groups "Android Developers" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/android-developers?hl=en

[android-developers] Re: SAXParser throws exception for bad character in CDATA block, bug???

Reply via email to