This is an automated email from the ASF dual-hosted git repository.
mbeckerle pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/daffodil.git
The following commit(s) were added to refs/heads/main by this push:
new 8433f4d83 Remapper from XMLIllegal to PUA ignores pre-existing PUA
characters in the input data.
8433f4d83 is described below
commit 8433f4d83d33da31371fcdfa016a7c627011aca3
Author: Michael Beckerle <[email protected]>
AuthorDate: Tue May 28 17:50:38 2024 -0400
Remapper from XMLIllegal to PUA ignores pre-existing PUA characters in the
input data.
Existing PUA chars were producing SDE in the InfosetOutputter to XML. This
is too late for a
parser to backtrack because of these characters.
Turns out if you do fuzz testing on data formats that use unicode charsets
(like zip
file format), then it's very easy to end up with some PUA characters in the
data.
This fix is needed for Daffodil 3.8.0 because layer algorithms for things
like zip files were
hitting this with their fuzz testing.
DAFFODIL-2883
---
.../scala/org/apache/daffodil/lib/xml/XMLUtils.scala | 17 +++++++++++++----
1 file changed, 13 insertions(+), 4 deletions(-)
diff --git
a/daffodil-lib/src/main/scala/org/apache/daffodil/lib/xml/XMLUtils.scala
b/daffodil-lib/src/main/scala/org/apache/daffodil/lib/xml/XMLUtils.scala
index d56415fa4..dc41c6504 100644
--- a/daffodil-lib/src/main/scala/org/apache/daffodil/lib/xml/XMLUtils.scala
+++ b/daffodil-lib/src/main/scala/org/apache/daffodil/lib/xml/XMLUtils.scala
@@ -153,11 +153,20 @@ object XMLUtils {
list
}
- // FIXME: DAFFODIL-2883 - this needs checkForExistingPUA to be false so that
data
- // which contains unicode PUA characters doesn't cause an SDE. Needs to be
either
- // accepted or optionally cause a ParseError.
private val remapXMLToPUA =
- new RemapXMLIllegalCharToPUA(checkForExistingPUA = true, replaceCRWithLF =
true)
+ new RemapXMLIllegalCharToPUA(
+ // To fix DAFFODIL-2883 - changed to tolerate existing PUA by default.
It just
+ // ignores them. If they happen to be ones we use like U+E000 for NUL,
then
+ // a round trip of the data will not preserve them.
+ //
+ // Note that fuzz testing that just permutes bytes along with unicode
charset data
+ // can run into this fairly easily. Since this remap is called in the
+ // InfosetOutputter converting the DFDL infoset to XML, you cannot, at
that point,
+ // fail in any useful way.
+ //
+ checkForExistingPUA = false,
+ replaceCRWithLF = true
+ )
def remapXMLIllegalCharactersToPUA(s: String): String =
remapXMLToPUA.remap(s)