[
https://issues.apache.org/jira/browse/HIVE-26639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
László Bodor updated HIVE-26639:
--------------------------------
Description:
In HS2 (and other components) we rely on UTF8 encoding, hence while storing
strings as bytes, we store the UTF8-encoded bytes. Some java APIs rely on
default system encoding in different ways, which can lead to incorrect encoding
(if system settings defaults other than UTF8). This patch intends to fix 2
different paths:
1. ConstantVectorExpression
in my case, this:
{code}
LOG.info("default charset name: " +
java.nio.charset.Charset.defaultCharset().name());
LOG.info("getBytes() = " + ((String) constantValue).getBytes());
LOG.info("getBytes(StandardCharsets.UTF_8) = " + ((String)
constantValue).getBytes(StandardCharsets.UTF_8));
{code}
led to:
{code}
default charset name: US-ASCII
getBytes() = [B@73dcffb0
getBytes(StandardCharsets.UTF_8) = [B@2ead0b9c
{code}
on the customer side, queries returned wrong results when the filter contained
the special character (which is part of UTF8 character table):
{code}
SELECT b FROM default.rlv_test1 where b='北京';
....
??
{code}
2. Explain
Similarly, explain printed to a PrintStream of different encoding, leading to a
plan like:
{code}
Map Operator Tree:
TableScan
alias: rlv_test1
filterExpr: (b = '??') (type: boolean)
Statistics: Num rows: 2 Data size: 352 Basic stats:
COMPLETE Column stats: COMPLETE
Filter Operator
predicate: (b = '??') (type: boolean)
Statistics: Num rows: 2 Data size: 352 Basic stats:
COMPLETE Column stats: COMPLETE
Select Operator
expressions: a (type: int), '??' (type: string),
c (type: string)
{code}
> ConstantVectorExpression shouldn't rely on default charset
> ----------------------------------------------------------
>
> Key: HIVE-26639
> URL: https://issues.apache.org/jira/browse/HIVE-26639
> Project: Hive
> Issue Type: Bug
> Reporter: László Bodor
> Assignee: László Bodor
> Priority: Major
> Labels: pull-request-available
> Time Spent: 20m
> Remaining Estimate: 0h
>
> In HS2 (and other components) we rely on UTF8 encoding, hence while storing
> strings as bytes, we store the UTF8-encoded bytes. Some java APIs rely on
> default system encoding in different ways, which can lead to incorrect
> encoding (if system settings defaults other than UTF8). This patch intends to
> fix 2 different paths:
> 1. ConstantVectorExpression
> in my case, this:
> {code}
> LOG.info("default charset name: " +
> java.nio.charset.Charset.defaultCharset().name());
> LOG.info("getBytes() = " + ((String) constantValue).getBytes());
> LOG.info("getBytes(StandardCharsets.UTF_8) = " + ((String)
> constantValue).getBytes(StandardCharsets.UTF_8));
> {code}
> led to:
> {code}
> default charset name: US-ASCII
> getBytes() = [B@73dcffb0
> getBytes(StandardCharsets.UTF_8) = [B@2ead0b9c
> {code}
> on the customer side, queries returned wrong results when the filter
> contained the special character (which is part of UTF8 character table):
> {code}
> SELECT b FROM default.rlv_test1 where b='北京';
> ....
> ??
> {code}
> 2. Explain
> Similarly, explain printed to a PrintStream of different encoding, leading to
> a plan like:
> {code}
> Map Operator Tree:
> TableScan
> alias: rlv_test1
> filterExpr: (b = '??') (type: boolean)
> Statistics: Num rows: 2 Data size: 352 Basic stats:
> COMPLETE Column stats: COMPLETE
> Filter Operator
> predicate: (b = '??') (type: boolean)
> Statistics: Num rows: 2 Data size: 352 Basic stats:
> COMPLETE Column stats: COMPLETE
> Select Operator
> expressions: a (type: int), '??' (type: string),
> c (type: string)
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)