Eddie Shipman wrote:
> READ THE RFC. Anything that is not escaped should be in
> a CDATASection, PERIOD. That was the question asked.

Given the following CDATA section:

<![CDATA[&#13;&#10;]]>

That node contains 10 characters. None of them is a carriage return or
line feed. That's the Delphi equivalent of this:

data := '&#13;&#10;';

A 10-character-long string.

But the data we're trying to get is only two characters, a carriage return
and a line feed. To get those two characters from the CDATA section, we'd
need to take the character data and send it _back_ through an XML
interpreter to have it treat that sequence of 10 characters as two numeric
character entities.

A CDATA section is simply a way to avoid escaping lots of characters that
the XML interpreter would otherwise treat specially. The example I gave
above is *no different* from the following, which doesn't use a CDATA
section:

&amp;#13;&amp;#10;

The issue is with _encoding_, not _escaping_.

The two characters in question have no special meaning in XML, so there's
no reason to escape them. There is no way _to_ escape them.

To output those characters, the XML serializer uses an identity
transformation and puts those two literal characters on the output stream.
What's desired, though, is for it to encode those characters differently,
instead of encoding them as their literal values.

Suppose the output encoding is US-ASCII. The internal representation is of
course Unicode. The serializer will normally write a carriage return as
the one-byte hexadecimal sequence 0x0D when it needs to output that
character. It has no reason to encode it any differently because a
carriage return is a perfectly valid US-ASCII character and does not
interfere with XML syntax.

If the character to output were U+2014, the em dash, then the serializer
would have to output the seven-byte character sequence "&#8212;" instead.
The em dash is not a valid US-ASCII character, so the serializer needs to
_encode_ that character some other way. There's nothing to escape, though,
since the em dash is not special in XML syntax. The serializer could _not_
use a CDATA section to contain the em dash because there is no way to
represent that character in US-ASCII without using special XML characters
to _encode_ it.

If the character to output were U+0026, the ampersand, then the serializer
would output the five-byte character sequence "&amp;" in its place. The
ampersand is a valid US-ASCII character, but since it also has special
meaning in XML, the serializer needs to use some other way of encoding
that character instead of using the literal value. If the serializer
instead chose to use a CDATA section for that character, it could output
the 13-byte character sequence "<![CDATA[&]]>", but that would be
wasteful.

-- 
Rob


_______________________________________________
Delphi mailing list -> [email protected]
http://www.elists.org/mailman/listinfo/delphi

Reply via email to