[email protected] has uploaded a new patch set (#13). ( 
http://gerrit.cloudera.org:8080/21131 )

Change subject: IMPALA-11499: Refactor UrlEncode function to handle special 
characters
......................................................................

IMPALA-11499: Refactor UrlEncode function to handle special characters

An error came from an issue with URL encoding, where certain Unicode
characters were being incorrectly encoded due to their UTF-8
representation matching characters in the set of characters to escape.
For example, the string '运', which consists of three bytes
0xe8 0xbf 0x90 was wrongly getting encoded into '\E8%FFFFFFBF\90',
because the middle byte matched one of the two bytes that
represented the "\u00FF" literal. Inclusion of "\u00FF" was likely
a mistake from the beginning and it should have been '\x7F'.

The patch addresses two key changes:
1. Before the change, there was a "set" of characters that needed
to be encoded, which were stored as a string. The current patch
uses an unordered_set which contains all the special characters
that needs to be escaped to handle the character encoding
efficiently.

2. '\xFF', which is an invalid UTF-8 byte whose inclusion was erroneous
from the beginning, is replaced with '\x7F', which is a control
character for DELETE, ensuring consistency and correctness in URL
encoding.

Testing: Tests on both traditional Hive tables and Iceberg tables
are included in unicode-column-name.test, insert.test and
coding-util-test.cc.

Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb
---
M be/src/util/coding-util-test.cc
M be/src/util/coding-util.cc
M testdata/workloads/functional-query/queries/QueryTest/insert.test
M testdata/workloads/functional-query/queries/QueryTest/unicode-column-name.test
4 files changed, 177 insertions(+), 18 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/31/21131/13
--
To view, visit http://gerrit.cloudera.org:8080/21131
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb
Gerrit-Change-Number: 21131
Gerrit-PatchSet: 13
Gerrit-Owner: Anonymous Coward <[email protected]>
Gerrit-Reviewer: Anonymous Coward <[email protected]>
Gerrit-Reviewer: Daniel Becker <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Michael Smith <[email protected]>
Gerrit-Reviewer: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Zihao Ye <[email protected]>

Reply via email to