[
https://issues.apache.org/jira/browse/DAFFODIL-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16814382#comment-16814382
]
Michael Beckerle commented on DAFFODIL-1559:
--------------------------------------------
Freestanding CR also needs to be preservable. Not just the CR of a CRLF pair.
Our current policy is described here:
[https://daffodil.apache.org/infoset/#xml-illegal-characters]
So we need an option to convert all CR to the code point #xE00D whether they
are isolated CR or part of CRLF pairs.
The code point #xE00D is in the Unicode Private Use Area (PUA), and XML
processing should then preserve it.
This needs to work for unparsing, with #xE00D turning into #xD (a CR character)
in the DFDL Infoset which is then unparsed as a regular #xD codepoint.
This behavior should, ultimately, be the default behavior. Converting CR to LF
and CRLF to LF not as delimiters, but in the data contents of an element, is
probably just wrong.
The vast number of tests for Daffodil with CRLF-related behaviors will be using
CR and LF and CRLF in delimiters. Those would be unaffected by this change.
This change is only about when CR or CRLF are found in data values of string
data. So perhaps there will not be a large impact on tests that are broken by
this change.
This change should be isolated to some utility functions (in
org.apache.daffodil.util), and to the InfosetInputter and InfosetOutputters
that consume and produce XML which convert between daffodil's DFDL infoset and
the XML Infoset.
Despite the fact that many people use Daffodil to convert data to/from XML, XML
technology is isolated in Daffodil to just the infoset inputter and infoset
outputters.
When parsing, Daffodil converts the data stream in to a DFDL Infoset. Then an
InfosetOutputter is called which traverse the DFDL Infoset creating an XML
Infoset (as text, scala XML objects, or JDOM objects - take your pick). This
suggested change only affects these Infoset outputters.
Similarly when unparsing, an InfosetInputter consumes XML and constructs the
DFDL Infoset. This DFDL Infoset is then traversed by the Daffodil unparser to
unparse data back to a data stream. This suggested change affects only these
InfosetInputters.
> Add option to disable CRLF to LF XML canonicalization
> -----------------------------------------------------
>
> Key: DAFFODIL-1559
> URL: https://issues.apache.org/jira/browse/DAFFODIL-1559
> Project: Daffodil
> Issue Type: New Feature
> Components: API
> Reporter: Steve Lawrence
> Priority: Minor
> Labels: beginner
>
> See the review or more details. The short of it is that when converting parse
> results to XML, we convert CR to LF, and we convert CRLF to LF. This means
> that we lose the information that the data used to contain CRLF. This is
> similar to how we lose that information with delimiters if someone uses NL,
> but it's slightly different since it is actual data. However, it's most user
> friendly and consistent with other XML technologies to have this behavior.
> Perhaps we need an option to convert CRLF to somewhere in PUA so that this
> information can be maintained if someone needs it.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)