mbeckerle commented on code in PR #1247:
URL: https://github.com/apache/daffodil/pull/1247#discussion_r1619061081


##########
daffodil-lib/src/main/scala/org/apache/daffodil/lib/xml/XMLUtils.scala:
##########
@@ -153,11 +153,20 @@ object XMLUtils {
     list
   }
 
-  // FIXME: DAFFODIL-2883 - this needs checkForExistingPUA to be false so that 
data
-  //  which contains unicode PUA characters doesn't cause an SDE. Needs to be 
either
-  //  accepted or optionally cause a ParseError.
   private val remapXMLToPUA =
-    new RemapXMLIllegalCharToPUA(checkForExistingPUA = true, replaceCRWithLF = 
true)
+    new RemapXMLIllegalCharToPUA(
+      // To fix DAFFODIL-2883 - changed to tolerate existing PUA by default. 
It just
+      // ignores them. If they happen to be ones we use like U+E000 for NUL, 
then
+      // a round trip of the data will not preserve them.
+      //
+      // Note that fuzz testing that just permutes bytes along with unicode 
charset data
+      // can run into this fairly easily. Since this remap is called in the
+      // InfosetOutputter converting the DFDL infoset to XML, you cannot, at 
that point,
+      // fail in any useful way.
+      //
+      checkForExistingPUA = false,
+      replaceCRWithLF = true
+    )
 
   def remapXMLIllegalCharactersToPUA(s: String): String = 
remapXMLToPUA.remap(s)

Review Comment:
   Yeah, the DFDL Infoset definitely allows all PUA characters. So DFDL Infoset 
to XML Infoset projection should not be causing errors on any unicode chars.
   
   Whether we allow existing PUA chars at parse time... it's an XML-specific 
parse-time option. We have no feature to declare them illegal, and if we did it 
ought to depend on dfdl:encodingErrorPolicy whether they get replaced or a 
parse error is signaled. 
   
   Right now, suppose you had data in UTF-8 that has existing PUA characters in 
it that happen to overlap with  those remapped by Daffodil. E.g, U+E000 which 
we use for NUL. Well your data cannot be faithfully converted to/from XML by 
Daffodil, simple as that. We will turn your E000 into a NUL in the infoset on 
unparsing. 
   
   This is not a Daffodil issue. It is simply not possible to convert data into 
XML if it uses every unicode code point, since we need some to represent the 
XML-illegal characters that are allowed in the DFDL Infoset. 
   
   The workaround is just say no to XML: convert to JSON first, then remap all 
your PUA chars that collide with the Daffodil remappings, converting those to 
some part of the PUA that neither your data, nor Daffodil, use. Then unparse. 
Now you have data that can be faithfully converted to XML and back by Daffodil. 
However, to unparse back to *exactly* what you started from, you have to invert 
this JSON operation as well. 
   
   To me, this is sufficient as a work-around for this very obscure problem. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to