Mon, 8 Jan 2024 23:58:36 +0000, /Joseph Kesselman/:

Please be careful to distinguish "Well-Formed"  from "Valid"...

A high-numeric-value surrogate pair shouldn't be a well-formedness issue, barring a bug.

For convenience of others who might want it: : Legal characters in XML 1.0 are defined at https://www.w3.org/TR/xml/#charsets

So the definition of a legal characters roughly is:

    any Unicode character, excluding the surrogate blocks, FFFE, and FFFF;

and the definition of a numeric character reference imposes the following well-formedness constraint:

Characters referred to using character references _must_ match the production for Char.

As far as I get, Xalan produces not well-formed UTF-8 documents when it serializes non-BMP characters.

I believe XML 1.1 added the null character, originally not accepted.

XML 1.1 generally added all but the NUL character (so � is still not well-formed):

https://www.w3.org/TR/xml11/#charsets

--
Stanimir

Reply via email to