[
https://issues.apache.org/jira/browse/DAFFODIL-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848244#comment-17848244
]
Mike Beckerle commented on DAFFODIL-1559:
-----------------------------------------
A thought. We could use somewhat more elaborate mappings.
Since XML doesn't preserve CR, we have to remap it to something unused that is
still a legal character. The PUA is one idea, but when people see their
CRLF-line-ending data as XML they will see goofy characters in it corresponding
to the CR character mapped to U+E00D, assuming they have a unicode font in use
and it has a glyph for those code points.
How about instead CRLF could map to a U+200B and a LF. The U+200B is a
zero-width-space, it indicates a word boundary, but occupies no space.
This map could then be inverted to convert the pair of U+200B, LF back into a
CRLF.
An isolated CR can be converted into a Unicode NEL or U+2028. And back when
unparsing.
This has the advantage of when you look at the XML data it will look normal. It
will have line-endings where you expect them. As with the PUA schema, the
length in characters (but not in bytes) will be the same.
For any character set that cannot represent U+200B and U+2028, (so all
single-byte charsets), this mapping is safe and invertible.
For UTF-8 or a unicode charset. These code points could theoretically appear in
the data. In that case we could remap those characters, which are very rare,
to/from the PUA, e.g. by adding 0xC100 to their code point.
This lets us perfectly round trip CR and CRLF, the XML data will at least look
normal to users, and have the same number of characters as were in the DFDL
infoset, and it avoids using the PUA for CR, which is ugly and unexpected by
users.
Now here's an even more aggresive mapping idea that builds on that one:
Map CR and CRLF as above. Map all the other C0 control characters (excluding
tab, LF) to their "control picture" characters which are unicode U+2400 to
U+241F. Map DEL (U+7F) to the U+2421 control picture for DEL.
If the user has a font that can display these characters, they will be
represented by their legible pictures.
If these characters appear in the source data, those characters can be remapped
to part of the PUA.
So the only problem is if the incoming data contains PUA characters. We would
not have a place to remap PUA characters, but this is not a new drawback, our
existing PUA mapping has this problem.
Here's a string of all the c0 control pictures
"␀␁␂␃␄␅␆␇␈␉␊␋␌␍␎␏␐␑␒␓␔␕␖␗␘␙␚␛␜␝␞␟ ␡" These are for U+00 to U+1F, then a space,
then the picture for DEL (U+7F).
They seem to display fine in this browser.
> Add option to disable CRLF to LF XML canonicalization
> -----------------------------------------------------
>
> Key: DAFFODIL-1559
> URL: https://issues.apache.org/jira/browse/DAFFODIL-1559
> Project: Daffodil
> Issue Type: Improvement
> Components: API
> Reporter: Steve Lawrence
> Priority: Major
> Labels: beginner
> Fix For: 4.0.0
>
>
> See the review or more details. The short of it is that when converting parse
> results to XML, we convert CR to LF, and we convert CRLF to LF. This means
> that we lose the information that the data used to contain CRLF. This is
> similar to how we lose that information with delimiters if someone uses NL,
> but it's slightly different since it is actual data. However, it's most user
> friendly and consistent with other XML technologies to have this behavior.
> Perhaps we need an option to convert CRLF to somewhere in PUA so that this
> information can be maintained if someone needs it.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)