Copilot commented on code in PR #4079:
URL: https://github.com/apache/logging-log4j2/pull/4079#discussion_r2983763152


##########
log4j-api-test/src/test/java/org/apache/logging/log4j/util/StringBuildersTest.java:
##########
@@ -79,15 +84,37 @@ void escapeJsonCharactersISOControl() {
         assertEquals(jsonValueEscaped, sb.toString());
     }
 
-    @Test
-    void escapeXMLCharactersCorrectly() {
-        final String xmlValueNotEscaped = "<\"Salt&Peppa'\">";
-        final String xmlValueEscaped = 
"&lt;&quot;Salt&amp;Peppa&apos;&quot;&gt;";
+    static Stream<Arguments> escapeXmlCharactersCorrectly() {
+        final char replacement = '\uFFFD';
+        return Stream.of(
+                // Empty
+                Arguments.of("", ""),
+                // characters that need to be escaped
+                Arguments.of("<\"Salt&Peppa'\">", 
"&lt;&quot;Salt&amp;Peppa&apos;&quot;&gt;"),
+                // control character replaced with U+FFFD
+                Arguments.of("A" + (char) 0x01 + "B", "A" + replacement + "B"),
+                // standalone low surrogate replaced with U+FFFD
+                Arguments.of("low" + Character.MIN_SURROGATE + "surrogate", 
"low" + replacement + "surrogate"),
+                Arguments.of(Character.MIN_SURROGATE + "low", replacement + 
"low"),
+                // standalone high surrogate replaced with U+FFFD
+                Arguments.of("high" + Character.MAX_SURROGATE + "surrogate", 
"high" + replacement + "surrogate"),
+                Arguments.of(Character.MAX_SURROGATE + "high", replacement + 
"high"),

Review Comment:
   The surrogate tests are using `Character.MIN_SURROGATE`/`MAX_SURROGATE`, but 
the comments label them as “low surrogate” and “high surrogate” respectively. 
`MIN_SURROGATE` is the minimum *high* surrogate (0xD800) and `MAX_SURROGATE` is 
the maximum *low* surrogate (0xDFFF), so these cases don’t actually exercise 
the intended branches (especially the invalid low-surrogate handling). Consider 
switching to `Character.MIN_LOW_SURROGATE`/`MAX_LOW_SURROGATE` and 
`Character.MIN_HIGH_SURROGATE`/`MAX_HIGH_SURROGATE` (and/or adjust the 
comments) so the tests match what they claim to cover.
   ```suggestion
                   Arguments.of("low" + Character.MIN_LOW_SURROGATE + 
"surrogate", "low" + replacement + "surrogate"),
                   Arguments.of(Character.MIN_LOW_SURROGATE + "low", 
replacement + "low"),
                   // standalone high surrogate replaced with U+FFFD
                   Arguments.of("high" + Character.MAX_HIGH_SURROGATE + 
"surrogate", "high" + replacement + "surrogate"),
                   Arguments.of(Character.MAX_HIGH_SURROGATE + "high", 
replacement + "high"),
   ```



##########
log4j-api/src/main/java/org/apache/logging/log4j/util/StringBuilders.java:
##########
@@ -369,8 +392,32 @@ public static void escapeXml(final StringBuilder 
toAppendTo, final int start) {
                     toAppendTo.setCharAt(lastPos--, '&');
                     break;
                 default:
-                    toAppendTo.setCharAt(lastPos--, c);
+                    toAppendTo.setCharAt(lastPos--, isValidXml10(c) ? c : 
REPLACEMENT_CHAR);
             }
         }
     }
+
+    /**
+     * Checks if a code point is a valid XML 1.0 character
+     *
+     * <p>This method is restricted to characters in the BMP, i.e. represented 
by one UTF-16 code unit.</p>
+     *
+     * @param codePoint a code point
+     * @return {@code true} if it is a valid XML 1.0 code point
+     */
+    private static boolean isValidXml10(final char codePoint) {
+        // XML 1.0 valid characters (Fifth Edition):
+        //   #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | 
[#x10000-#x10FFFF]
+
+        // [#x20–#xD7FF] (placed early as a fast path for the most common case)
+        return (codePoint >= ' ' && codePoint < Character.MIN_SURROGATE)
+                // #x9
+                || codePoint == '\t'
+                // #xA
+                || codePoint == '\n'
+                // #xD
+                || codePoint == '\r'
+                // [#xE000-#xFFFD]
+                || (codePoint > Character.MAX_SURROGATE && codePoint <= 
0xFFFD);

Review Comment:
   `isValidXml10` is documented as operating on a “code point”, but the 
signature takes a `char` (BMP-only). To avoid confusion, consider renaming the 
parameter to something like `ch`/`c` and updating the Javadoc to explicitly say 
“BMP char” rather than “code point”.
   ```suggestion
        * Checks if a BMP {@code char} is a valid XML 1.0 character.
        *
        * <p>This method is restricted to characters in the BMP, i.e. 
represented by one UTF-16 code unit.</p>
        *
        * @param ch a BMP {@code char} to validate
        * @return {@code true} if it is a valid XML 1.0 character
        */
       private static boolean isValidXml10(final char ch) {
           // XML 1.0 valid characters (Fifth Edition):
           //   #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | 
[#x10000-#x10FFFF]
   
           // [#x20–#xD7FF] (placed early as a fast path for the most common 
case)
           return (ch >= ' ' && ch < Character.MIN_SURROGATE)
                   // #x9
                   || ch == '\t'
                   // #xA
                   || ch == '\n'
                   // #xD
                   || ch == '\r'
                   // [#xE000-#xFFFD]
                   || (ch > Character.MAX_SURROGATE && ch <= 0xFFFD);
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to