This is an automated email from the ASF dual-hosted git repository.

mbeckerle pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/daffodil.git


The following commit(s) were added to refs/heads/main by this push:
     new 8433f4d83 Remapper from XMLIllegal to PUA ignores pre-existing PUA 
characters in the input data.
8433f4d83 is described below

commit 8433f4d83d33da31371fcdfa016a7c627011aca3
Author: Michael Beckerle <[email protected]>
AuthorDate: Tue May 28 17:50:38 2024 -0400

    Remapper from XMLIllegal to PUA ignores pre-existing PUA characters in the 
input data.
    
    Existing PUA chars were producing SDE in the InfosetOutputter to XML. This 
is too late for a
    parser to backtrack because of these characters.
    
    Turns out if you do fuzz testing on data formats that use unicode charsets 
(like zip
    file format), then it's very easy to end up with some PUA characters in the 
data.
    
    This fix is needed for Daffodil 3.8.0 because layer algorithms for things 
like zip files were
    hitting this with their fuzz testing.
    
    DAFFODIL-2883
---
 .../scala/org/apache/daffodil/lib/xml/XMLUtils.scala    | 17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git 
a/daffodil-lib/src/main/scala/org/apache/daffodil/lib/xml/XMLUtils.scala 
b/daffodil-lib/src/main/scala/org/apache/daffodil/lib/xml/XMLUtils.scala
index d56415fa4..dc41c6504 100644
--- a/daffodil-lib/src/main/scala/org/apache/daffodil/lib/xml/XMLUtils.scala
+++ b/daffodil-lib/src/main/scala/org/apache/daffodil/lib/xml/XMLUtils.scala
@@ -153,11 +153,20 @@ object XMLUtils {
     list
   }
 
-  // FIXME: DAFFODIL-2883 - this needs checkForExistingPUA to be false so that 
data
-  //  which contains unicode PUA characters doesn't cause an SDE. Needs to be 
either
-  //  accepted or optionally cause a ParseError.
   private val remapXMLToPUA =
-    new RemapXMLIllegalCharToPUA(checkForExistingPUA = true, replaceCRWithLF = 
true)
+    new RemapXMLIllegalCharToPUA(
+      // To fix DAFFODIL-2883 - changed to tolerate existing PUA by default. 
It just
+      // ignores them. If they happen to be ones we use like U+E000 for NUL, 
then
+      // a round trip of the data will not preserve them.
+      //
+      // Note that fuzz testing that just permutes bytes along with unicode 
charset data
+      // can run into this fairly easily. Since this remap is called in the
+      // InfosetOutputter converting the DFDL infoset to XML, you cannot, at 
that point,
+      // fail in any useful way.
+      //
+      checkForExistingPUA = false,
+      replaceCRWithLF = true
+    )
 
   def remapXMLIllegalCharactersToPUA(s: String): String = 
remapXMLToPUA.remap(s)
 

Reply via email to