[jira] [Comment Edited] (DAFFODIL-2346) XML Output needs option to use around simple element values containing whitespace.

Mike Beckerle (Jira) Thu, 16 Jun 2022 08:56:06 -0700


    [ 
https://issues.apache.org/jira/browse/DAFFODIL-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17555128#comment-17555128
 ]


Mike Beckerle edited comment on DAFFODIL-2346 at 6/16/22 3:55 PM:
------------------------------------------------------------------

Update: This ticket is pretty outdated. We have learned much about the 
limitations of XML as a data language since then.

This comment is to clarify some things about space preservation. it is kind of 
an essay on using XML as a data language.

The xml:space='preserve' is about *whitespace-only* nodes in the text. The XML 
spec says "to set apart the markup", and [this site 
|http://www.xmlplease.com/xml/xmlspace/] clarifies that the spec means not 
whitespace generally, but only between XML elements.

So to clarify: xml:space='preserve' allows conversion of this:
{code:java}
<x xml:space='preserve'>
this is <y> some data </y>
<z> and more </z>
</x>
{code}
into
{code:java}
<x xml:space='preserve'>this is <y> some data </y>
<z> and more </z>
</x>
{code}
Note that the whitespace-only line ending between the end `</y>` and the `<z>` 
tag is preserved, as is the whitespace-only line ending betwen the `</z>` and 
the `</x>`. But the whitespace before the word "this" at the start of the 
content of element x, is NOT (necessarily) preserved. That's because this is 
not a whitespace-only node. It is part of the text node that is the first child 
node of element x.

The default whitespace policy would let this whole thing be recast on a single 
line of text.

Given this strange whitespace-only-node focus, the only thing xml:space is good 
for is something like this:
{code:java}
<myHaiku xml:space='preserve'>
<line>Of all the gin joints</line>
<line>In all the towns of the world</line>
<line>She walked into mine</line>
</myHaiku>
{code}
In that case, the xml:space='preserve' tells an XML processor to preserve the 
whitespaces before, between, and after the line elements. Those are all 
whitespace-only nodes. Note however, if this was deeply indented by a XML 
pretty printer, it could turn this into:
{code:java}
<myHaiku xml:space='preserve'>
<line>
  Of all the
  gin joints
</line>
<line>
  In all the towns
  of the world
</line>
<line>
  She walked into
  mine
</line>
</myHaiku>
{code}
So the whitespace inside the line elements would not be preserved.

The conclusion from all of this is that xml:space attribute is just not helpful 
for much of anything when using XML as a data language, because as a data 
language we do NOT care about the whitespace between element tags. We ONLY care 
about the whitespace within elements of simple text content, and xml:space has 
nothing to do with those.

Hence, using xml:space='preserve' is not a relevant way to solve this problem.

First, it has nothing to do with whether CRLF and CR are converted to LF. XML 
parsers do this regardless of xml:space='preserve'. Second it has nothing to do 
with whether XML processors will clobber whitespace characters when trying to 
pretty print or wrap long lines.

The only way to get XML text to preserve whitespace characters is this:
 # replace all whitespace characters by their equivalent numeric character 
entities. This includes spaces between words of ordinary text.
 # encapsulate all characters, including whitespace, with CDATA bracketing.
Note that CDATA bracketing cannot contain the sequence "]]>" nor can it contain 
characters that must be preserved using numeric character entities like CR 
where must be used.

The best we can do is expect short strings containing single spaces to be 
preserved usually. This is a heuristic, but most likely is what people want. 
However any whitespace character other than a single space BETWEEN (only. Not 
at start or end of string.) words of short strings should be replaced by a 
numeric character entity. 
So consider this example where the line ending after the word 'spaces' is a CR:
{code:java}
<x>  some text with    four spaces
in a row</x>
{code}
If we escape everything, this must become
{code:java}
<x>&#x20;&#x20;some&#x20;text&#x20;with&#x20;&#x20;&#x20;&#x20;four&#x20;spaces&#x0D;in&#x20;a&#x20;row</x>
{code}
If we're willing to heuristically risk single spaces, then this can be 
simplified to using regular spaces when there are just individual ones between 
words:
{code:java}
<x>&#x20;&#x20;some text with &#x20;&#x20;&#x20;four spaces&#x0D;in a row</x>
{code}
or this:
{code:java}
<x><![CDATA[  some text with    four spaces]]>&#x0D;<![CDATA[in a row]]></x>
{code}
All of these are very ugly, but are the cost of preserving string values 
perfectly when using XML as a data language, not the text markup language it 
was originally intended for.

Long strings need to always get the escaping per above (either replace all 
whitespace with entities, or use CDATA where possible).

A mode where the escaping is done for ALL strings, even short ones, should be 
available as well, and perhaps should be the default, except for compatibility 
with current infosets.

All of the above assumes that when comparing XML infosets, one uses a 
type-aware comparison. Otherwise things like:
{code:java}
<x>
  123.456
</x>
{code}
will not be considered equivalent to:
{code:java}
<x>123.456</x>
{code}
Without schemas; however, this can only be done either by heuristic (if it 
looks like a number, let's compare it like one), or by adding xsi:type 
annotation attributes:
{code:java}
<x xsi:type='xs:decimal'>
  123.456
</x>
{code}
(Note there are JIRA tickets about adding xsi:type annotations as an option to 
XML infosets. DAFFODIL-182 for infosets, and DAFFODIL-2402 for the TDML runner)

Without that information, schema, or just heuristic guessing, one would have to 
assume the values are different since their text is different.

Or... one could assume that whitespace is fungible, and compare after having 
collapsed whitepsace. This could give false positive comparisons, but may be 
preferable.


was (Author: mbeckerle):
Update: This ticket is pretty outdated. We have learned much about the 
limitations of XML as a data language since then.

This comment is to clarify some things about space preservation. it is kind of 
an essay on using XML as a data language.

The xml:space='preserve' is about *whitespace-only* nodes in the text. The XML 
spec says "to set apart the markup", and [this site 
|http://www.xmlplease.com/xml/xmlspace/] clarifies that the spec means not 
whitespace generally, but only between XML elements.

So to clarify: xml:space='preserve' allows conversion of this:
{code:java}
<x xml:space='preserve'>
this is <y> some data </y>
<z> and more </z>
</x>
{code}
into

{code:java}
<x xml:space='preserve'>this is <y> some data </y>
<z> and more </z>
</x>
{code}
Note that the whitespace-only line ending between the end `</y>` and the `<z>` 
tag is preserved, as is the whitespace-only line ending betwen the `</z>` and 
the `</x>`. But the whitespace before the word "this" at the start of the 
content of element x, is NOT (necessarily) preserved. That's because this is 
not a whitespace-only node. It is part of the text node that is the first child 
node of element x. 

The default whitespace policy would let this whole thing be recast on a single 
line of text. 

Given this strange whitespace-only-node focus, the only thing xml:space is good 
for is something like this:
{code:java}
<myHaiku xml:space='preserve'>
<line>Of all the gin joints</line>
<line>In all the towns of the world</line>
<line>She walked into mine</line>
</myHaiku>
{code}
In that case, the xml:space='preserve' tells an XML processor to preserve the 
whitespaces before, between, and after the line elements. Those are all 
whitespace-only nodes. Note however, if this was deeply indented by a XML 
pretty printer, it could turn this into:

{code:java}
<myHaiku xml:space='preserve'>
<line>
  Of all the
  gin joints
</line>
<line>
  In all the towns
  of the world
</line>
<line>
  She walked into
  mine
</line>
</myHaiku>
{code}

So the whitespace inside the line elements would not be preserved. 

The conclusion from all of this is that xml:space attribute is just not helpful 
for much of anything when using XML as a data language, because as a data 
language we do NOT care about the whitespace between element tags. We ONLY care 
about the whitespace within elements of simple text content, and xml:space has 
nothing to do with those. 

Hence, using xml:space='preserve' is not a relevant way to solve this problem. 

First, it has nothing to do with whether CRLF and CR are converted to LF. XML 
parsers do this regardless of xml:space='preserve'. Second it has nothing to do 
with whether XML processors will clobber whitespace characters when trying to 
pretty print or wrap long lines. 

The only way to get XML text to preserve whitespace characters is this: 
# replace all whitespace characters by their equivalent numeric character 
entities. This includes spaces between words of ordinary text. 
# encapsulate all characters, including whitespace, with CDATA bracketing. 
Note that CDATA bracketing cannot contain the sequence "]]>" nor can it contain 
characters that must be preserved using numeric character entities like CR 
where &amp;#x0D; must be used. 

The best we can do is expect short strings containing single spaces to be 
preserved usually. This is a heuristic, but most likely is what people want. 
However any whitespace character other than a single space BETWEEN (only. Not 
at start or end of string.) words of short strings should be replaced by a 
numeric character entity. 
So consider this example where the line ending after the word 'spaces' is a CR:
{code:java}
<x>  some text with    four spaces
in a row</x>
{code}
If we escape everything, this must become

{code:java}
<x>&#x20;&#x20;some text with &#x20;&#x20;&#x20;four spaces&#x0D;in a row</x>
{code}
or this:

{code:java}
<x><![CDATA[  some text with    four spaces]]>&#x0D;<![CDATA[in a row]]></x>
{code}
Both are pretty ugly, but given the infrequency of CR in data, that particular 
ugliness is less likely to actually appear. 

Long strings need to always get the escaping per above (either replace all 
whitespace with entities, or use CDATA where possible).  

A mode where the escaping is done for ALL strings should be available as well. 

> XML Output needs option to use <![CDATA[     ]]> around simple element values 
> containing whitespace.
> ----------------------------------------------------------------------------------------------------
>
>                 Key: DAFFODIL-2346
>                 URL: https://issues.apache.org/jira/browse/DAFFODIL-2346
>             Project: Daffodil
>          Issue Type: Bug
>          Components: Back End
>    Affects Versions: 2.6.0
>            Reporter: Mike Beckerle
>            Priority: Minor
>
> It is incredibly painful to take the XML output, pretty print it to make it 
> readable, and find out that this has mangled the significant whitespace 
> inside element values. 
> In general, since whitespace within simple values is considered fungible in 
> XML, we have to protect whitespace that is truly part of the DFDL infoset. 
> I think CDATA bracketing is preferable to replacing whitespace characters 
> with XML escaping like &amp;#x20; 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Comment Edited] (DAFFODIL-2346) XML Output needs option to use around simple element values containing whitespace.

Reply via email to