mbeckerle commented on code in PR #1247:
URL: https://github.com/apache/daffodil/pull/1247#discussion_r1619061081
##########
daffodil-lib/src/main/scala/org/apache/daffodil/lib/xml/XMLUtils.scala:
##########
@@ -153,11 +153,20 @@ object XMLUtils {
list
}
- // FIXME: DAFFODIL-2883 - this needs checkForExistingPUA to be false so that
data
- // which contains unicode PUA characters doesn't cause an SDE. Needs to be
either
- // accepted or optionally cause a ParseError.
private val remapXMLToPUA =
- new RemapXMLIllegalCharToPUA(checkForExistingPUA = true, replaceCRWithLF =
true)
+ new RemapXMLIllegalCharToPUA(
+ // To fix DAFFODIL-2883 - changed to tolerate existing PUA by default.
It just
+ // ignores them. If they happen to be ones we use like U+E000 for NUL,
then
+ // a round trip of the data will not preserve them.
+ //
+ // Note that fuzz testing that just permutes bytes along with unicode
charset data
+ // can run into this fairly easily. Since this remap is called in the
+ // InfosetOutputter converting the DFDL infoset to XML, you cannot, at
that point,
+ // fail in any useful way.
+ //
+ checkForExistingPUA = false,
+ replaceCRWithLF = true
+ )
def remapXMLIllegalCharactersToPUA(s: String): String =
remapXMLToPUA.remap(s)
Review Comment:
Yeah, the DFDL Infoset definitely allows all PUA characters. So DFDL Infoset
to XML Infoset projection should not be causing errors on any unicode chars.
Whether we allow existing PUA chars at parse time... it's an XML-specific
parse-time option. We have no feature to declare them illegal, and if we did it
ought to depend on dfdl:encodingErrorPolicy whether they get replaced or a
parse error is signaled.
Right now, suppose you had data in UTF-8 that has existing PUA characters in
it that happen to overlap with those remapped by Daffodil. E.g, U+E000 which
we use for NUL. Well your data cannot be faithfully converted to/from XML by
Daffodil, simple as that. We will turn your E000 into a NUL in the infoset on
unparsing.
This is not a Daffodil issue. It is simply not possible to convert data into
XML if it uses every unicode code point, since we need some to represent the
XML-illegal characters that are allowed in the DFDL Infoset.
The workaround is just say no to XML: convert to JSON first, then remap all
your PUA chars that collide with the Daffodil remappings, converting those to
some part of the PUA that neither your data, nor Daffodil, use. Then unparse.
Now you have data that can be faithfully converted to XML and back by Daffodil.
However, to unparse back to *exactly* what you started from, you have to invert
this JSON operation as well.
To me, this is sufficient as a work-around for this very obscure problem.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]