[
https://issues.apache.org/jira/browse/DAFFODIL-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17555128#comment-17555128
]
Mike Beckerle edited comment on DAFFODIL-2346 at 6/16/22 3:55 PM:
------------------------------------------------------------------
Update: This ticket is pretty outdated. We have learned much about the
limitations of XML as a data language since then.
This comment is to clarify some things about space preservation. it is kind of
an essay on using XML as a data language.
The xml:space='preserve' is about *whitespace-only* nodes in the text. The XML
spec says "to set apart the markup", and [this site
|http://www.xmlplease.com/xml/xmlspace/] clarifies that the spec means not
whitespace generally, but only between XML elements.
So to clarify: xml:space='preserve' allows conversion of this:
{code:java}
<x xml:space='preserve'>
this is <y> some data </y>
<z> and more </z>
</x>
{code}
into
{code:java}
<x xml:space='preserve'>this is <y> some data </y>
<z> and more </z>
</x>
{code}
Note that the whitespace-only line ending between the end `</y>` and the `<z>`
tag is preserved, as is the whitespace-only line ending betwen the `</z>` and
the `</x>`. But the whitespace before the word "this" at the start of the
content of element x, is NOT (necessarily) preserved. That's because this is
not a whitespace-only node. It is part of the text node that is the first child
node of element x.
The default whitespace policy would let this whole thing be recast on a single
line of text.
Given this strange whitespace-only-node focus, the only thing xml:space is good
for is something like this:
{code:java}
<myHaiku xml:space='preserve'>
<line>Of all the gin joints</line>
<line>In all the towns of the world</line>
<line>She walked into mine</line>
</myHaiku>
{code}
In that case, the xml:space='preserve' tells an XML processor to preserve the
whitespaces before, between, and after the line elements. Those are all
whitespace-only nodes. Note however, if this was deeply indented by a XML
pretty printer, it could turn this into:
{code:java}
<myHaiku xml:space='preserve'>
<line>
Of all the
gin joints
</line>
<line>
In all the towns
of the world
</line>
<line>
She walked into
mine
</line>
</myHaiku>
{code}
So the whitespace inside the line elements would not be preserved.
The conclusion from all of this is that xml:space attribute is just not helpful
for much of anything when using XML as a data language, because as a data
language we do NOT care about the whitespace between element tags. We ONLY care
about the whitespace within elements of simple text content, and xml:space has
nothing to do with those.
Hence, using xml:space='preserve' is not a relevant way to solve this problem.
First, it has nothing to do with whether CRLF and CR are converted to LF. XML
parsers do this regardless of xml:space='preserve'. Second it has nothing to do
with whether XML processors will clobber whitespace characters when trying to
pretty print or wrap long lines.
The only way to get XML text to preserve whitespace characters is this:
# replace all whitespace characters by their equivalent numeric character
entities. This includes spaces between words of ordinary text.
# encapsulate all characters, including whitespace, with CDATA bracketing.
Note that CDATA bracketing cannot contain the sequence "]]>" nor can it contain
characters that must be preserved using numeric character entities like CR
where must be used.
The best we can do is expect short strings containing single spaces to be
preserved usually. This is a heuristic, but most likely is what people want.
However any whitespace character other than a single space BETWEEN (only. Not
at start or end of string.) words of short strings should be replaced by a
numeric character entity.
So consider this example where the line ending after the word 'spaces' is a CR:
{code:java}
<x> some text with four spaces
in a row</x>
{code}
If we escape everything, this must become
{code:java}
<x>  some text with    four spaces
in a row</x>
{code}
If we're willing to heuristically risk single spaces, then this can be
simplified to using regular spaces when there are just individual ones between
words:
{code:java}
<x>  some text with    four spaces
in a row</x>
{code}
or this:
{code:java}
<x><![CDATA[ some text with four spaces]]>
<![CDATA[in a row]]></x>
{code}
All of these are very ugly, but are the cost of preserving string values
perfectly when using XML as a data language, not the text markup language it
was originally intended for.
Long strings need to always get the escaping per above (either replace all
whitespace with entities, or use CDATA where possible).
A mode where the escaping is done for ALL strings, even short ones, should be
available as well, and perhaps should be the default, except for compatibility
with current infosets.
All of the above assumes that when comparing XML infosets, one uses a
type-aware comparison. Otherwise things like:
{code:java}
<x>
123.456
</x>
{code}
will not be considered equivalent to:
{code:java}
<x>123.456</x>
{code}
Without schemas; however, this can only be done either by heuristic (if it
looks like a number, let's compare it like one), or by adding xsi:type
annotation attributes:
{code:java}
<x xsi:type='xs:decimal'>
123.456
</x>
{code}
(Note there are JIRA tickets about adding xsi:type annotations as an option to
XML infosets. DAFFODIL-182 for infosets, and DAFFODIL-2402 for the TDML runner)
Without that information, schema, or just heuristic guessing, one would have to
assume the values are different since their text is different.
Or... one could assume that whitespace is fungible, and compare after having
collapsed whitepsace. This could give false positive comparisons, but may be
preferable.
was (Author: mbeckerle):
Update: This ticket is pretty outdated. We have learned much about the
limitations of XML as a data language since then.
This comment is to clarify some things about space preservation. it is kind of
an essay on using XML as a data language.
The xml:space='preserve' is about *whitespace-only* nodes in the text. The XML
spec says "to set apart the markup", and [this site
|http://www.xmlplease.com/xml/xmlspace/] clarifies that the spec means not
whitespace generally, but only between XML elements.
So to clarify: xml:space='preserve' allows conversion of this:
{code:java}
<x xml:space='preserve'>
this is <y> some data </y>
<z> and more </z>
</x>
{code}
into
{code:java}
<x xml:space='preserve'>this is <y> some data </y>
<z> and more </z>
</x>
{code}
Note that the whitespace-only line ending between the end `</y>` and the `<z>`
tag is preserved, as is the whitespace-only line ending betwen the `</z>` and
the `</x>`. But the whitespace before the word "this" at the start of the
content of element x, is NOT (necessarily) preserved. That's because this is
not a whitespace-only node. It is part of the text node that is the first child
node of element x.
The default whitespace policy would let this whole thing be recast on a single
line of text.
Given this strange whitespace-only-node focus, the only thing xml:space is good
for is something like this:
{code:java}
<myHaiku xml:space='preserve'>
<line>Of all the gin joints</line>
<line>In all the towns of the world</line>
<line>She walked into mine</line>
</myHaiku>
{code}
In that case, the xml:space='preserve' tells an XML processor to preserve the
whitespaces before, between, and after the line elements. Those are all
whitespace-only nodes. Note however, if this was deeply indented by a XML
pretty printer, it could turn this into:
{code:java}
<myHaiku xml:space='preserve'>
<line>
Of all the
gin joints
</line>
<line>
In all the towns
of the world
</line>
<line>
She walked into
mine
</line>
</myHaiku>
{code}
So the whitespace inside the line elements would not be preserved.
The conclusion from all of this is that xml:space attribute is just not helpful
for much of anything when using XML as a data language, because as a data
language we do NOT care about the whitespace between element tags. We ONLY care
about the whitespace within elements of simple text content, and xml:space has
nothing to do with those.
Hence, using xml:space='preserve' is not a relevant way to solve this problem.
First, it has nothing to do with whether CRLF and CR are converted to LF. XML
parsers do this regardless of xml:space='preserve'. Second it has nothing to do
with whether XML processors will clobber whitespace characters when trying to
pretty print or wrap long lines.
The only way to get XML text to preserve whitespace characters is this:
# replace all whitespace characters by their equivalent numeric character
entities. This includes spaces between words of ordinary text.
# encapsulate all characters, including whitespace, with CDATA bracketing.
Note that CDATA bracketing cannot contain the sequence "]]>" nor can it contain
characters that must be preserved using numeric character entities like CR
where &#x0D; must be used.
The best we can do is expect short strings containing single spaces to be
preserved usually. This is a heuristic, but most likely is what people want.
However any whitespace character other than a single space BETWEEN (only. Not
at start or end of string.) words of short strings should be replaced by a
numeric character entity.
So consider this example where the line ending after the word 'spaces' is a CR:
{code:java}
<x> some text with four spaces
in a row</x>
{code}
If we escape everything, this must become
{code:java}
<x>  some text with    four spaces
in a row</x>
{code}
or this:
{code:java}
<x><![CDATA[ some text with four spaces]]>
<![CDATA[in a row]]></x>
{code}
Both are pretty ugly, but given the infrequency of CR in data, that particular
ugliness is less likely to actually appear.
Long strings need to always get the escaping per above (either replace all
whitespace with entities, or use CDATA where possible).
A mode where the escaping is done for ALL strings should be available as well.
> XML Output needs option to use <![CDATA[ ]]> around simple element values
> containing whitespace.
> ----------------------------------------------------------------------------------------------------
>
> Key: DAFFODIL-2346
> URL: https://issues.apache.org/jira/browse/DAFFODIL-2346
> Project: Daffodil
> Issue Type: Bug
> Components: Back End
> Affects Versions: 2.6.0
> Reporter: Mike Beckerle
> Priority: Minor
>
> It is incredibly painful to take the XML output, pretty print it to make it
> readable, and find out that this has mangled the significant whitespace
> inside element values.
> In general, since whitespace within simple values is considered fungible in
> XML, we have to protect whitespace that is truly part of the DFDL infoset.
> I think CDATA bracketing is preferable to replacing whitespace characters
> with XML escaping like &#x20;
--
This message was sent by Atlassian Jira
(v8.20.7#820007)