Copilot commented on code in PR #4133:
URL: https://github.com/apache/gobblin/pull/4133#discussion_r2314368875


##########
gobblin-api/src/main/java/org/apache/gobblin/compat/hadoop/TextSerializer.java:
##########
@@ -31,20 +30,21 @@ public class TextSerializer {
    * Serialize a String using the same logic as a Hadoop Text object
    */
   public static void writeStringAsText(DataOutput stream, String str) throws 
IOException {
-    byte[] utf8Encoded = str.getBytes(StandardCharsets.UTF_8);
-    writeVLong(stream, utf8Encoded.length);
-    stream.write(utf8Encoded);
+    writeVLong(stream, str.length());
+    stream.writeBytes(str);

Review Comment:
   `DataOutput.writeBytes()` only writes the low 8 bits of each character, 
which will corrupt any characters outside the ASCII range (0-127). This breaks 
Unicode support that was previously handled by UTF-8 encoding.



##########
gobblin-api/src/main/java/org/apache/gobblin/compat/hadoop/TextSerializer.java:
##########
@@ -31,20 +30,21 @@ public class TextSerializer {
    * Serialize a String using the same logic as a Hadoop Text object
    */
   public static void writeStringAsText(DataOutput stream, String str) throws 
IOException {
-    byte[] utf8Encoded = str.getBytes(StandardCharsets.UTF_8);
-    writeVLong(stream, utf8Encoded.length);
-    stream.write(utf8Encoded);
+    writeVLong(stream, str.length());
+    stream.writeBytes(str);

Review Comment:
   Using `str.length()` for the length will cause deserialization errors for 
multi-byte UTF-8 characters. The length should represent the number of bytes, 
not the number of characters. Multi-byte UTF-8 characters will have different 
byte lengths than character counts.



##########
gobblin-api/src/main/java/org/apache/gobblin/compat/hadoop/TextSerializer.java:
##########
@@ -31,20 +30,21 @@ public class TextSerializer {
    * Serialize a String using the same logic as a Hadoop Text object
    */
   public static void writeStringAsText(DataOutput stream, String str) throws 
IOException {
-    byte[] utf8Encoded = str.getBytes(StandardCharsets.UTF_8);
-    writeVLong(stream, utf8Encoded.length);
-    stream.write(utf8Encoded);
+    writeVLong(stream, str.length());
+    stream.writeBytes(str);
   }
 
   /**
    * Deserialize a Hadoop Text object into a String
    */
   public static String readTextAsString(DataInput in) throws IOException {
-    int bufLen = (int)readVLong(in);
-    byte[] buf = new byte[bufLen];
-    in.readFully(buf);
+    int bufLen = (int) readVLong(in);
+    StringBuilder sb = new StringBuilder();
 
-    return new String(buf, StandardCharsets.UTF_8);
+    for (int i = 0; i < bufLen; i++) {
+      sb.append((char) in.readByte());

Review Comment:
   Casting a byte directly to char will produce incorrect results for 
multi-byte UTF-8 characters. This approach only works correctly for ASCII 
characters (0-127) and will corrupt Unicode text.



##########
gobblin-api/src/main/java/org/apache/gobblin/compat/hadoop/TextSerializer.java:
##########
@@ -31,20 +30,21 @@ public class TextSerializer {
    * Serialize a String using the same logic as a Hadoop Text object
    */
   public static void writeStringAsText(DataOutput stream, String str) throws 
IOException {
-    byte[] utf8Encoded = str.getBytes(StandardCharsets.UTF_8);
-    writeVLong(stream, utf8Encoded.length);
-    stream.write(utf8Encoded);
+    writeVLong(stream, str.length());
+    stream.writeBytes(str);
   }

Review Comment:
   The method name suggests Hadoop Text compatibility, but the implementation 
is no longer compatible with Hadoop's Text serialization format, which uses 
UTF-8 byte encoding. This could break interoperability with Hadoop systems.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to