[jira] [Commented] (DAFFODIL-1559) Add option to disable CRLF to LF XML canonicalization

Michael Beckerle (JIRA) Wed, 10 Apr 2019 04:53:27 -0700


    [ 
https://issues.apache.org/jira/browse/DAFFODIL-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16814382#comment-16814382
 ]


Michael Beckerle commented on DAFFODIL-1559:
--------------------------------------------

Freestanding CR also needs to be preservable. Not just the CR of a CRLF pair. 

Our current policy is described here: 
[https://daffodil.apache.org/infoset/#xml-illegal-characters]

So we need an option to convert all CR to the code point #xE00D whether they 
are isolated CR or part of CRLF pairs.

The code point #xE00D is in the Unicode Private Use Area (PUA), and XML 
processing should then preserve it.

This needs to work for unparsing, with #xE00D turning into #xD (a CR character) 
in the DFDL Infoset which is then unparsed as a regular #xD codepoint.

This behavior should, ultimately, be the default behavior. Converting CR to LF 
and CRLF to LF not as delimiters, but in the data contents of an element, is 
probably just wrong.

The vast number of tests for Daffodil with CRLF-related behaviors will be using 
CR and LF and CRLF in delimiters. Those would be unaffected by this change. 
This change is only about when CR or CRLF are found in data values of string 
data.  So perhaps there will not be a large impact on tests that are broken by 
this change.

This change should be isolated to some utility functions (in 
org.apache.daffodil.util), and to the InfosetInputter and InfosetOutputters 
that consume and produce XML which convert between daffodil's DFDL infoset and 
the XML Infoset.

Despite the fact that many people use Daffodil to convert data to/from XML, XML 
technology is isolated in Daffodil to just the infoset inputter and infoset 
outputters.

When parsing, Daffodil converts the data stream in to a DFDL Infoset. Then an 
InfosetOutputter is called which traverse the DFDL Infoset creating an XML 
Infoset (as text, scala XML objects, or JDOM objects - take your pick). This 
suggested change only affects these Infoset outputters.

Similarly when unparsing, an InfosetInputter consumes XML and constructs the 
DFDL Infoset. This DFDL Infoset is then traversed by the Daffodil unparser to 
unparse data back to a data stream. This suggested change affects only these 
InfosetInputters.

 

> Add option to disable CRLF to LF XML canonicalization
> -----------------------------------------------------
>
>                 Key: DAFFODIL-1559
>                 URL: https://issues.apache.org/jira/browse/DAFFODIL-1559
>             Project: Daffodil
>          Issue Type: New Feature
>          Components: API
>            Reporter: Steve Lawrence
>            Priority: Minor
>              Labels: beginner
>
> See the review or more details. The short of it is that when converting parse 
> results to XML, we convert CR to LF, and we convert CRLF to LF. This means 
> that we lose the information that the data used to contain CRLF. This is 
> similar to how we lose that information with delimiters if someone uses NL, 
> but it's slightly different since it is actual data. However, it's most user 
> friendly and consistent with other XML technologies to have this behavior.
> Perhaps we need an option to convert CRLF to somewhere in PUA so that this 
> information can be maintained if someone needs it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (DAFFODIL-1559) Add option to disable CRLF to LF XML canonicalization

Reply via email to