To extend what Jens said -- and sorry for this sounding like a rant, but I think this is a very important point that is often misunderstood, and often causes a LOT of harm. Don't worry, I'm not ranting at you or any person in particular.
The UTF-8 encoding of the character in question is 0xE2 0x80 0x99. 0xE2 is รข <http://en.wikipedia.org/wiki/%C3%82> -- a legitimate character in ISO-8859-1. 0x80 and 0x99 are both nonsense. As Jens points out, CDATA is not "binary data". If you want to encode binary data, convert it to base64 or something similar, and dump it into a CDATA section. Frankly, I think you have to be nuts to be sending ISO-8851-1 XML. Every single time I've encountered ISO-8859-1 XML being produced, I've regarded it as a bug, and fixed it. There is absolutely no reason, ever, to produce such a thing. You're doing it wrong if you write it, and you're doing it wrong if that's all you can read. UTF-8 is always acceptable by a conforming reader. ISO-8859-1 is not required to be supported by any conforming reader -- and never has been. Clearly, here, someone wanted to include a character not in ISO-8859-1 -- as they should have every expectation of being able to do. But whoever was doing the actual output is doing some sleight-of-hand, fake-XML stuff, rather than using an actual XML library. Any reasonable XML library is going to refuse to write that character (by throwing an exception) if told to write it to an ISO-8859-1 document. I counsel people to NEVER hand-construct XML -- always use an API that is incapable of producing malformed XML. Not only is it quite difficult to get right, and always get it right, when constructing by hand -- the practice simply does not scale. The more XML you hand-produce, the more certain it becomes that somewhere, somehow, you will have some edge case you failed to consider, and it becomes a virtual certainty that you will introduce a bug (often, this very sort of bug). Bottom line -- it is wrong to write ISO-8859-1 XML without the consent of the recipient (which is never possible in an RSS feed), it is wrong to do so because it will fail -- somewhere, sometime, someone is sure to use a non-ISO-8859-1 character, it is wrong to send non-text (i.e. invalid data bytes in the chosen encoding) in CDATA, it is wrong to stick UTF-8 bytes in a CDATA in an ISO-8859-1 document. So I have a serious problem with Jens' solution, and you should think very carefully about it before adopting it, especially applying it to other contexts where this objection may be more serious. Jens' "solution" is really, take broken data, and silently corrupting it further. I have seen entire databases mangled by applying this philosophy -- there was a bug in the system, and it used some library (often the programmer was unaware of this behavior), that took anything it didn't understand, and turned it into '?' or some other fixed character, or, as here, dropped it entirely. The world is filled with databases and files which have had this sort of data corruption applied, and once it is done, there is no hope whatsoever if repair. It is in most cases, far better to instantly, irrevocably, fail. Don't let that bad data get any further. If you're reading ISO-8859-1, and you find something that's not one of the 191 characters in ISO-8859-1, you have data corruption. Stop there. If you're reading UTF-8, and you find something that's not properly-encoded UTF-8, again, you have data corruption. Stop before it gets worse. Whether that applies here, depends on what you're planning to do with the data you're collecting. If it's ephemeral, Jens' solution may be appropriate -- the consequence of propagating corrupt data may be less than obscuring what information the user could glean from the rubbish. But if you're persisting the data, or the data has any serious consequence -- think very carefully before taking any action other than throwing an error. That's what the SAX parser is doing here, and it is entirely correct in taking this stance. Always use UTF-8. Always use an XML library for both input and output. This really is a place where it pays to be dogmatic. -- You received this message because you are subscribed to the Google Groups "Android Developers" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/android-developers?hl=en

