I didn't know this, so I imagine others might not. The string "�" is invalid XML. The character is simply not allowed in XML in any representation. XML 1.0 standard blocks most of the characters under x20, allowing only x9 xA and xD. XML 1.1 allows x1-x20, but still blocks x0.

http://www.w3.org/TR/1998/REC-xml-19980210#charsets
http://www.w3.org/TR/xml11/#charsets

This creates an interesting problem for serializing Java strings containing the null character, e.g. "\u0000", or for other non-whitespace control characters like the bell character "\u0007". We've got an integration test for this case in Surefire, and it does entirely the wrong thing (SUREFIRE-455).

In the patch submitted to that bug, Todor throws away nulls in his XML escaper, silently omitting them from the output; all other control characters (even the 1.0-illegal ones) pass through. That doesn't seem right, especially when we're talking about test results! (Expected "" but was "" ... Just imagine how painful it would be to track something like that down.)

But neither does it seem right to insert "�" when it's illegal XML. Notably, Java will cheerfully print � in XML if you tell it to do so, and many parsers will figure out what to do with it just fine; the same applies to "".

Thoughts? Should we emit "�", standards-be-damned? Silently omit the character? Print a "?" instead? Something else?

-Dan

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to