[ https://issues.apache.org/jira/browse/DAFFODIL-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17955975#comment-17955975 ]
Mike Beckerle commented on DAFFODIL-1559: ----------------------------------------- This CRLF being lossy issue came up again recently. I'd like to suggest that we add a DFDL property to help here, rather than some way to parameterize infoset outputters/inputters. Note that DFDL allows us to specify custom character set encodings. So we could define X-DAFFODIL-ASCII-JSON to mean the same thing as ascii, but with C0 controls turned into JSON style escapes (along with backslash being doubled). Given that we can do that, it's fair game to say "properties can control the way strings convert into the infoset". So we provide an extension property such as dfdlx:infosetStringRemap which identifies how to insert escapes/replacement characters into strings of the infoset. So dfdl:encoding="ascii" dfdlx:infosetStringRemap="json" would be the same thing as I just described as X-DAFFODIL-ASCII-JSON encoding. A variation "jsonExceptLF" would be the same except leaving LF alone. There really is no reason not to allow a DFDL property to control this behavior. This has the potential to be more efficient, as it can be done as the string is parsed/unparsed, rather than as yet another operation on the string value after the infoset has been created. The various suggestions provided in this thread could each have a name. (The ones we bother to implement) I think the current scheme (the default) would be named "Xml1.0IllegalRemapDropCR" or something else that makes it clear the CRs are going to be dropped, except now they would be dropped before the InfosetOutputter processes them. There are implications here. If in the DFDL schema you have facet patterns those regex's would be processing the infoset which would already have the escaping applied to it, and that's different from now where those patterns operate on the idealized DFDL infoset string. I think in Cyberia (Cyber security application area) the requirements are to have options allowing you to achieve these requirements (not simultaneously) # not lossy # number of infoset characters matches number in original data # canonicalizes line endings. Not all of those are necessarily satisfied at the same time. JSON styles would satisfy (1) and be mostly readable, jsonExceptLF would improve readability. Things like replacing CRLF by U+202B and LF, and replacing isolated CR by Unicode 2028, that plus PUA remapping would allow achieving (2). Choice (3) can be achieved by ordinary DFDL where you parse the string into an array of strings delimited by all the various kinds of line endings. > Add option to disable CRLF to LF XML canonicalization > ----------------------------------------------------- > > Key: DAFFODIL-1559 > URL: https://issues.apache.org/jira/browse/DAFFODIL-1559 > Project: Daffodil > Issue Type: Improvement > Components: API > Reporter: Steve Lawrence > Priority: Major > Labels: beginner > Fix For: 4.0.0 > > > See the review or more details. The short of it is that when converting parse > results to XML, we convert CR to LF, and we convert CRLF to LF. This means > that we lose the information that the data used to contain CRLF. This is > similar to how we lose that information with delimiters if someone uses NL, > but it's slightly different since it is actual data. However, it's most user > friendly and consistent with other XML technologies to have this behavior. > Perhaps we need an option to convert CRLF to somewhere in PUA so that this > information can be maintained if someone needs it. -- This message was sent by Atlassian Jira (v8.20.10#820010)