Re: [PR] [WIP][SQL] Invalid UTF-8 byte sequence replacement [spark]

via GitHub Fri, 07 Jun 2024 05:35:30 -0700


uros-db commented on code in PR #46899:
URL: https://github.com/apache/spark/pull/46899#discussion_r1631136369



##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -270,6 +279,123 @@ public byte[] getBytes() {
     }
   }
 
+  /**
+   * Utility methods and constants for UTF-8 string validation.
+   */
+
+  private static boolean isValidContinuationByte(byte b) {
+     return (byte) 0x80 <= b && b <= (byte) 0xBF;
+  }
+
+  private static boolean isValidSecondByte(byte b, byte firstByte) {
+    return switch (firstByte) {
+      case (byte) 0xE0 -> (byte) 0xA0 <= b && b <= (byte) 0xBF;
+      case (byte) 0xED -> (byte) 0x80 <= b && b <= (byte) 0x9F;
+      case (byte) 0xF0 -> (byte) 0x90 <= b && b <= (byte) 0xBF;
+      case (byte) 0xF4 -> (byte) 0x80 <= b && b <= (byte) 0x8F;
+      default -> isValidContinuationByte(b);
+    };
+  }
+
+  private static final byte[] UNICODE_REPLACEMENT_CHARACTER =
+    new byte[] { (byte) 0xEF, (byte) 0xBF, (byte) 0xBD };
+
+  private static void appendReplacementCharacter(ArrayList<Byte> bytes) {
+    for (byte b : UTF8String.UNICODE_REPLACEMENT_CHARACTER) bytes.add(b);
+  }
+
+  /**
+   * Returns a validated version of the current UTF-8 string by replacing 
invalid UTF-8 sequences
+   * with the Unicode replacement character (U+FFFD), as per the rules defined 
in the Unicode
+   * standard. This behaviour is consistent with the behaviour of 
`UnicodeString` in ICU4C.
+   *
+   * @return A new UTF8String that is a valid UTF8 byte sequence.
+   */
+  public UTF8String makeValidUTF8() {
+    ArrayList<Byte> bytes = new ArrayList<>();
+    int byteIndex = 0;
+    byteIteration:
+    while (byteIndex < numBytes) {
+      // Read the first byte.
+      byte firstByte = getByte(byteIndex);
+      int expectedLen = bytesOfCodePointInUTF8[firstByte & 0xFF];
+      int codePointLen = Math.min(expectedLen, numBytes - byteIndex);
+      // 0B UTF-8 sequence (invalid first byte).
+      if (codePointLen == 0) {
+        appendReplacementCharacter(bytes);
+        byteIndex += 1;
+        continue;
+      }
+      // 1B UTF-8 sequence (ASCII or invalid).
+      if (codePointLen == 1) {
+        if (firstByte >= 0) bytes.add(firstByte);
+        else appendReplacementCharacter(bytes);
+        byteIndex += 1;
+        continue;
+      }
+      // Read the second byte.
+      byte secondByte = getByte(byteIndex + 1);
+      if (!isValidSecondByte(secondByte, firstByte)) {
+        appendReplacementCharacter(bytes);
+        byteIndex += 1;
+        continue;
+      }
+      // Read remaining continuation bytes.
+      int continuationBytes = 2;
+      for (; continuationBytes < codePointLen; ++continuationBytes) {
+        byte nextByte = getByte(byteIndex + continuationBytes);
+        if (!isValidContinuationByte(nextByte)) {
+          break;
+        }
+      }
+      // Invalid UTF-8 sequence (not enough continuation bytes).
+      if (continuationBytes < expectedLen) {
+        appendReplacementCharacter(bytes);
+        byteIndex += continuationBytes;
+        continue;
+      }
+      // Valid UTF-8 sequence.
+      for (int i = 0; i < codePointLen; ++i) {
+        bytes.add(getByte(byteIndex + i));
+      }
+      byteIndex += codePointLen;
+    }
+    return UTF8String.fromBytes(bytes);
+  }
+
+  /**
+   * Checks if the current UTF8String is valid.
+   *
+   * @return If string represents a valid UTF8 byte sequence.
+   */
+  public boolean isValidUTF8() {
+    return makeValidUTF8().equals(this);
+  }
+
+  /**
+   * Returns the current string if it is a valid UTF8 byte sequence, otherwise 
throws an exception.
+   *
+   * @return The UTF8String itself if it is a valid UTF8 byte sequence.
+   */
+  public UTF8String validateUTF8() {
+    if (!isValidUTF8()) {
+      throw new IllegalArgumentException("Invalid UTF-8 string");

Review Comment:
   indeed the intent was to use this directly for a new Spark expression 
`ValidateUTF8` and delegate execution to UTF8String
   
   however, isValid() should be sufficient as well, and we can implement the 
rest in scala



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [WIP][SQL] Invalid UTF-8 byte sequence replacement [spark]

Reply via email to