Re: [PR] HIVE-28728: fix STR_TO_MAP() returning garbled utf-8 characters when vectorization is enabled [hive]

via GitHub Tue, 23 Jun 2026 17:44:48 -0700


Copilot commented on code in PR #6559:
URL: https://github.com/apache/hive/pull/6559#discussion_r3463816230



##########
ql/src/java/org/apache/hadoop/hive/ql/exec/vector/VectorAssignRow.java:
##########
@@ -21,6 +21,7 @@
 import java.util.ArrayList;
 import java.util.List;
 import java.util.Map;
+import java.nio.charset.StandardCharsets;
 

Review Comment:
   Java imports are out of order: `StandardCharsets` should be grouped with 
other `java.*` imports (typically before `java.util.*`) to match the ordering 
used elsewhere in this package and avoid checkstyle churn.



##########
ql/src/java/org/apache/hadoop/hive/ql/exec/vector/VectorAssignRow.java:
##########
@@ -465,7 +466,7 @@ private void assignRowColumn(
           {
             if (object instanceof String) {
               String string = (String) object;
-              byte[] bytes = string.getBytes();
+              byte[] bytes = string.getBytes(StandardCharsets.UTF_8);
               ((BytesColumnVector) columnVector).setVal(
                   batchIndex, bytes, 0, bytes.length);

Review Comment:
   This change fixes `STRING` by forcing UTF-8 bytes, but the same method still 
uses platform-default `String#getBytes()` for `VARCHAR`/`CHAR` (and the 
analogous code path later in this class). With a non-UTF-8 default charset 
those branches can still emit garbled UTF-8; please switch those conversions to 
`getBytes(StandardCharsets.UTF_8)` as well (e.g., current 
`HiveVarchar.getValue().getBytes()` / `HiveChar.getStrippedValue().getBytes()`).



##########
ql/src/test/queries/clientpositive/str_to_map_utf8.q:
##########
@@ -0,0 +1,45 @@
+-- HIVE-28728: STR_TO_MAP() must preserve UTF-8 in vectorized execution when 
JVM default charset is not UTF-8.
+-- Tez container opts below force US-ASCII in Tez tasks
+-- Use driver-level mimic for testing with llap: 
-Dmaven.test.jvm.args="-Dfile.encoding=US-ASCII"
+
+SET tez.am.launch.cmd-opts=-Dfile.encoding=US-ASCII;
+SET hive.tez.java.opts=-Dfile.encoding=US-ASCII;
+SET hive.vectorized.execution.enabled=true;
+SET hive.fetch.task.conversion=none;
+
+CREATE TABLE hive28728_src (id string, name string, multi string) STORED AS 
ORC;
+INSERT INTO hive28728_src VALUES
+  ('100','hive', 'en:1'),

Review Comment:
   This test creates several persistent tables (`hive28728_src`, 
`hive28728_result`, `hive28728_multi`, `hive28728_result_novec`) but never 
drops them. Since the suite cleanup scripts don’t remove these tables, reruns 
(or other tests that might reuse names) can fail on `CREATE TABLE`; please add 
`DROP TABLE IF EXISTS ...` before creation and/or drop the tables at the end 
(and update the expected `.q.out` files accordingly).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HIVE-28728: fix STR_TO_MAP() returning garbled utf-8 characters when vectorization is enabled [hive]

Reply via email to