[
https://issues.apache.org/jira/browse/HIVE-26639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
László Bodor resolved HIVE-26639.
---------------------------------
Resolution: Fixed
> ConstantVectorExpression and ExplainTask shouldn't rely on default charset
> --------------------------------------------------------------------------
>
> Key: HIVE-26639
> URL: https://issues.apache.org/jira/browse/HIVE-26639
> Project: Hive
> Issue Type: Bug
> Reporter: László Bodor
> Assignee: László Bodor
> Priority: Major
> Labels: pull-request-available
> Fix For: 4.0.0-alpha-2
>
> Time Spent: 40m
> Remaining Estimate: 0h
>
> In HS2 (and other components) we rely on UTF8 encoding, hence while storing
> strings as bytes, we store the UTF8-encoded bytes. Some java APIs rely on
> default system encoding in different ways, which can lead to incorrect
> encoding (if system settings defaults other than UTF8). This patch intends to
> fix 2 different paths:
> 1. ConstantVectorExpression
> in my case, this:
> {code}
> LOG.info("default charset name: " +
> java.nio.charset.Charset.defaultCharset().name());
> LOG.info("getBytes() = " + ((String) constantValue).getBytes());
> LOG.info("getBytes(StandardCharsets.UTF_8) = " + ((String)
> constantValue).getBytes(StandardCharsets.UTF_8));
> {code}
> led to:
> {code}
> default charset name: US-ASCII
> getBytes() = [B@73dcffb0
> getBytes(StandardCharsets.UTF_8) = [B@2ead0b9c
> {code}
> on the customer side, queries returned wrong results when the filter
> contained the special character (which is part of UTF8 character table):
> {code}
> SELECT b FROM default.rlv_test1 where b='北京';
> ....
> ??
> {code}
> 2. Explain
> Similarly, explain printed to a PrintStream of different encoding, leading to
> a plan like:
> {code}
> Map Operator Tree:
> TableScan
> alias: test_table
> filterExpr: (b = '??') (type: boolean)
> Statistics: Num rows: 2 Data size: 352 Basic stats:
> COMPLETE Column stats: COMPLETE
> Filter Operator
> predicate: (b = '??') (type: boolean)
> Statistics: Num rows: 2 Data size: 352 Basic stats:
> COMPLETE Column stats: COMPLETE
> Select Operator
> expressions: a (type: int), '??' (type: string),
> c (type: string)
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)