Copilot commented on code in PR #4133: URL: https://github.com/apache/gobblin/pull/4133#discussion_r2314368875
########## gobblin-api/src/main/java/org/apache/gobblin/compat/hadoop/TextSerializer.java: ########## @@ -31,20 +30,21 @@ public class TextSerializer { * Serialize a String using the same logic as a Hadoop Text object */ public static void writeStringAsText(DataOutput stream, String str) throws IOException { - byte[] utf8Encoded = str.getBytes(StandardCharsets.UTF_8); - writeVLong(stream, utf8Encoded.length); - stream.write(utf8Encoded); + writeVLong(stream, str.length()); + stream.writeBytes(str); Review Comment: `DataOutput.writeBytes()` only writes the low 8 bits of each character, which will corrupt any characters outside the ASCII range (0-127). This breaks Unicode support that was previously handled by UTF-8 encoding. ########## gobblin-api/src/main/java/org/apache/gobblin/compat/hadoop/TextSerializer.java: ########## @@ -31,20 +30,21 @@ public class TextSerializer { * Serialize a String using the same logic as a Hadoop Text object */ public static void writeStringAsText(DataOutput stream, String str) throws IOException { - byte[] utf8Encoded = str.getBytes(StandardCharsets.UTF_8); - writeVLong(stream, utf8Encoded.length); - stream.write(utf8Encoded); + writeVLong(stream, str.length()); + stream.writeBytes(str); Review Comment: Using `str.length()` for the length will cause deserialization errors for multi-byte UTF-8 characters. The length should represent the number of bytes, not the number of characters. Multi-byte UTF-8 characters will have different byte lengths than character counts. ########## gobblin-api/src/main/java/org/apache/gobblin/compat/hadoop/TextSerializer.java: ########## @@ -31,20 +30,21 @@ public class TextSerializer { * Serialize a String using the same logic as a Hadoop Text object */ public static void writeStringAsText(DataOutput stream, String str) throws IOException { - byte[] utf8Encoded = str.getBytes(StandardCharsets.UTF_8); - writeVLong(stream, utf8Encoded.length); - stream.write(utf8Encoded); + writeVLong(stream, str.length()); + stream.writeBytes(str); } /** * Deserialize a Hadoop Text object into a String */ public static String readTextAsString(DataInput in) throws IOException { - int bufLen = (int)readVLong(in); - byte[] buf = new byte[bufLen]; - in.readFully(buf); + int bufLen = (int) readVLong(in); + StringBuilder sb = new StringBuilder(); - return new String(buf, StandardCharsets.UTF_8); + for (int i = 0; i < bufLen; i++) { + sb.append((char) in.readByte()); Review Comment: Casting a byte directly to char will produce incorrect results for multi-byte UTF-8 characters. This approach only works correctly for ASCII characters (0-127) and will corrupt Unicode text. ########## gobblin-api/src/main/java/org/apache/gobblin/compat/hadoop/TextSerializer.java: ########## @@ -31,20 +30,21 @@ public class TextSerializer { * Serialize a String using the same logic as a Hadoop Text object */ public static void writeStringAsText(DataOutput stream, String str) throws IOException { - byte[] utf8Encoded = str.getBytes(StandardCharsets.UTF_8); - writeVLong(stream, utf8Encoded.length); - stream.write(utf8Encoded); + writeVLong(stream, str.length()); + stream.writeBytes(str); } Review Comment: The method name suggests Hadoop Text compatibility, but the implementation is no longer compatible with Hadoop's Text serialization format, which uses UTF-8 byte encoding. This could break interoperability with Hadoop systems. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@gobblin.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org