[jira] [Commented] (DAFFODIL-1559) Add option to disable CRLF to LF XML canonicalization

Mike Beckerle (Jira) Tue, 03 Jun 2025 13:08:04 -0700


    [ 
https://issues.apache.org/jira/browse/DAFFODIL-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17955975#comment-17955975
 ]


Mike Beckerle commented on DAFFODIL-1559:
-----------------------------------------

This CRLF being lossy issue came up again recently. 

I'd like to suggest that we add a DFDL property to help here, rather than some 
way to parameterize infoset outputters/inputters. 

Note that DFDL allows us to specify custom character set encodings. So we could 
define X-DAFFODIL-ASCII-JSON
to mean the same thing as ascii, but with C0 controls turned into JSON style 
escapes (along with backslash being doubled).

Given that we can do that, it's fair game to say "properties can control the 
way strings convert into the infoset".

So we provide an extension property such as dfdlx:infosetStringRemap which 
identifies how to insert escapes/replacement characters into strings of the 
infoset. 

So dfdl:encoding="ascii" dfdlx:infosetStringRemap="json" would be the same 
thing as I just described as X-DAFFODIL-ASCII-JSON encoding.  A variation 
"jsonExceptLF" would be the same except leaving LF alone. 

There really is no reason not to allow a DFDL property to control this 
behavior. 

This has the potential to be more efficient, as it can be done as the string is 
parsed/unparsed, rather than as yet another operation on the string value after 
the infoset has been created. 

The various suggestions provided in this thread could each have a name. (The 
ones we bother to implement)

I think the current scheme (the default) would be named 
"Xml1.0IllegalRemapDropCR" or something else that makes it clear the CRs are 
going to be dropped, except now they would be dropped before the 
InfosetOutputter processes them.

There are implications here. If in the DFDL schema you have facet patterns 
those regex's would be processing the infoset which would already have the 
escaping applied to it, and that's different from now where those patterns 
operate on the idealized DFDL infoset string.  

I think in Cyberia (Cyber security application area) the requirements are to 
have options allowing you to achieve these requirements (not simultaneously)
 # not lossy 
 # number of infoset characters matches number in original data
 # canonicalizes line endings. 

Not all of those are necessarily satisfied at the same time. JSON styles would 
satisfy (1) and be mostly readable, jsonExceptLF would improve readability. 
Things like replacing CRLF by U+202B and LF, and replacing isolated CR by 
Unicode 2028, that plus PUA remapping would allow achieving (2). Choice (3) can 
be achieved by ordinary DFDL where you parse the string into an array of 
strings delimited by all the various kinds of line endings. 

 

 

> Add option to disable CRLF to LF XML canonicalization
> -----------------------------------------------------
>
>                 Key: DAFFODIL-1559
>                 URL: https://issues.apache.org/jira/browse/DAFFODIL-1559
>             Project: Daffodil
>          Issue Type: Improvement
>          Components: API
>            Reporter: Steve Lawrence
>            Priority: Major
>              Labels: beginner
>             Fix For: 4.0.0
>
>
> See the review or more details. The short of it is that when converting parse 
> results to XML, we convert CR to LF, and we convert CRLF to LF. This means 
> that we lose the information that the data used to contain CRLF. This is 
> similar to how we lose that information with delimiters if someone uses NL, 
> but it's slightly different since it is actual data. However, it's most user 
> friendly and consistent with other XML technologies to have this behavior.
> Perhaps we need an option to convert CRLF to somewhere in PUA so that this 
> information can be maintained if someone needs it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (DAFFODIL-1559) Add option to disable CRLF to LF XML canonicalization

Reply via email to