Mon, 8 Jan 2024 23:58:36 +0000, /Joseph Kesselman/:
Please be careful to distinguish "Well-Formed" from "Valid"...
A high-numeric-value surrogate pair shouldn't be a well-formedness
issue, barring a bug.
For convenience of others who might want it: : Legal characters in XML
1.0 are defined at https://www.w3.org/TR/xml/#charsets
So the definition of a legal characters roughly is:
any Unicode character, excluding the surrogate blocks, FFFE, and FFFF;
and the definition of a numeric character reference imposes the
following well-formedness constraint:
Characters referred to using character references _must_ match the
production for Char.
As far as I get, Xalan produces not well-formed UTF-8 documents when it
serializes non-BMP characters.
I believe XML 1.1 added the null
character, originally not accepted.
XML 1.1 generally added all but the NUL character (so � is still not
well-formed):
https://www.w3.org/TR/xml11/#charsets
--
Stanimir