[
https://issues.apache.org/jira/browse/IMPALA-11499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17845052#comment-17845052
]
ASF subversion and git services commented on IMPALA-11499:
----------------------------------------------------------
Commit 85cd07a11e876f3d8773f2638f699c61a6b0dd4c in impala's branch
refs/heads/master from pranavyl
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=85cd07a11 ]
IMPALA-11499: Refactor UrlEncode function to handle special characters
An error came from an issue with URL encoding, where certain Unicode
characters were being incorrectly encoded due to their UTF-8
representation matching characters in the set of characters to escape.
For example, the string '运', which consists of three bytes
0xe8 0xbf 0x90 was wrongly getting encoded into '\E8%FFFFFFBF\90',
because the middle byte matched one of the two bytes that
represented the "\u00FF" literal. Inclusion of "\u00FF" was likely
a mistake from the beginning and it should have been '\x7F'.
The patch makes three key changes:
1. Before the change, the set of characters that need to be escaped
was stored as a string. The current patch uses an unordered_set
instead.
2. '\xFF', which is an invalid UTF-8 byte and whose inclusion was
erroneous from the beginning, is replaced with '\x7F', which is a
control character for DELETE, ensuring consistency and correctness in
URL encoding.
3. The list of characters to be escaped is extended to match the
current list in Hive.
Testing: Tests on both traditional Hive tables and Iceberg tables
are included in unicode-column-name.test, insert.test,
coding-util-test.cc and test_insert.py.
Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb
Reviewed-on: http://gerrit.cloudera.org:8080/21131
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>
> Refactor UrlEncode function to handle special characters
> --------------------------------------------------------
>
> Key: IMPALA-11499
> URL: https://issues.apache.org/jira/browse/IMPALA-11499
> Project: IMPALA
> Issue Type: Bug
> Components: Backend
> Reporter: Quanlong Huang
> Assignee: Pranav Yogi Lodha
> Priority: Critical
>
> Partition values are incorrectly URL-encoded in backend for unicode
> characters, e.g. '运营业务数据' is encoded to '�%FFFFFFBF�营业务数据' which is wrong.
> To reproduce the issue, first create a partition table:
> {code:sql}
> create table my_part_tbl (id int) partitioned by (p string) stored as parquet;
> {code}
> Then insert data into it using partition values containing '运'. They will
> fail:
> {noformat}
> [localhost:21050] default> insert into my_part_tbl partition(p='运营业务数据')
> values (0);
> Query: insert into my_part_tbl partition(p='运营业务数据') values (0)
> Query submitted at: 2022-08-16 10:03:56 (Coordinator:
> http://quanlong-OptiPlex-BJ:25000)
> Query progress can be monitored at:
> http://quanlong-OptiPlex-BJ:25000/query_plan?query_id=404ac3027c4b7169:39d16a2d00000000
> ERROR: Error(s) moving partition files. First error (of 1) was: Hdfs op
> (RENAME
> hdfs://localhost:20500/test-warehouse/my_part_tbl/_impala_insert_staging/404ac3027c4b7169_39d16a2d00000000/.404ac3027c4b7169-39d16a2d00000000_1475855322_dir/p=�%FFFFFFBF�营业务数据/404ac3027c4b7169-39d16a2d00000000_1585092794_data.0.parq
> TO
> hdfs://localhost:20500/test-warehouse/my_part_tbl/p=�%FFFFFFBF�营业务数据/404ac3027c4b7169-39d16a2d00000000_1585092794_data.0.parq)
> failed, error was:
> hdfs://localhost:20500/test-warehouse/my_part_tbl/_impala_insert_staging/404ac3027c4b7169_39d16a2d00000000/.404ac3027c4b7169-39d16a2d00000000_1475855322_dir/p=�%FFFFFFBF�营业务数据/404ac3027c4b7169-39d16a2d00000000_1585092794_data.0.parq
> Error(5): Input/output error
> [localhost:21050] default> insert into my_part_tbl partition(p='运') values
> (0);
> Query: insert into my_part_tbl partition(p='运') values (0)
> Query submitted at: 2022-08-16 10:04:22 (Coordinator:
> http://quanlong-OptiPlex-BJ:25000)
> Query progress can be monitored at:
> http://quanlong-OptiPlex-BJ:25000/query_plan?query_id=a64e5883473ec28d:86e7e33500000000
> ERROR: Error(s) moving partition files. First error (of 1) was: Hdfs op
> (RENAME
> hdfs://localhost:20500/test-warehouse/my_part_tbl/_impala_insert_staging/a64e5883473ec28d_86e7e33500000000/.a64e5883473ec28d-86e7e33500000000_1582623091_dir/p=�%FFFFFFBF�/a64e5883473ec28d-86e7e33500000000_163454510_data.0.parq
> TO
> hdfs://localhost:20500/test-warehouse/my_part_tbl/p=�%FFFFFFBF�/a64e5883473ec28d-86e7e33500000000_163454510_data.0.parq)
> failed, error was:
> hdfs://localhost:20500/test-warehouse/my_part_tbl/_impala_insert_staging/a64e5883473ec28d_86e7e33500000000/.a64e5883473ec28d-86e7e33500000000_1582623091_dir/p=�%FFFFFFBF�/a64e5883473ec28d-86e7e33500000000_163454510_data.0.parq
> Error(5): Input/output error
> {noformat}
> However, partition value without the character '运' is OK:
> {noformat}
> [localhost:21050] default> insert into my_part_tbl partition(p='营业务数据')
> values (0);
> Query: insert into my_part_tbl partition(p='营业务数据') values (0)
> Query submitted at: 2022-08-16 10:04:13 (Coordinator:
> http://quanlong-OptiPlex-BJ:25000)
> Query progress can be monitored at:
> http://quanlong-OptiPlex-BJ:25000/query_plan?query_id=b04894bfcfc3836a:b1ac903600000000
> Modified 1 row(s) in 0.21s
> {noformat}
> Hive is able to execute all these statements.
> I'm able to narrow down the issue into Backend, where we URL-encode the
> partition value in HdfsTableSink::InitOutputPartition():
> {code:cpp}
> string value_str;
> partition_key_expr_evals_[j]->PrintValue(value, &value_str);
> // Directory names containing partition-key values need to be
> UrlEncoded, in
> // particular to avoid problems when '/' is part of the key value
> (which might
> // occur, for example, with date strings). Hive will URL decode the
> value
> // transparently when Impala's frontend asks the metastore for
> partition key values,
> // which makes it particularly important that we use the same encoding
> as Hive. It's
> // also not necessary to encode the values when writing partition
> metadata. You can
> // check this with 'show partitions <tbl>' in Hive, followed by a
> select from a
> // decoded partition key value.
> string encoded_str;
> UrlEncode(value_str, &encoded_str, true);
> string part_key_value = (encoded_str.empty() ?
> table_desc_->null_partition_key_value() :
> encoded_str);
> // If the string is empty, map it to nullptr (mimicking Hive's
> behaviour)
> partition_name_ss << part_key_value;
> if (is_external_part) {
> external_partition_name_ss << part_key_value;
> }
> {code}
> https://github.com/apache/impala/blob/3f51a6a761bcb93675088afc155ccc3d2a380401/be/src/exec/hdfs-table-sink.cc#L543
> Current code of UrlEncode seems not handling UTF-8 encoded strings correctly:
> {code:cpp}
> static inline void UrlEncode(const char* in, int in_len, string* out, bool
> hive_compat) {
> (*out).reserve(in_len);
> stringstream ss;
> for (int i = 0; i < in_len; ++i) {
> const char ch = in[i];
> // Escape the character iff a) we are in Hive-compat mode and the
> // character is in the Hive whitelist or b) we are not in
> // Hive-compat mode, and the character is not alphanumeric or one
> // of the four commonly excluded characters.
> if ((hive_compat && HiveShouldEscape(ch)) ||
> (!hive_compat && !(isalnum(ch) || ShouldNotEscape(ch)))) {
> ss << '%' << uppercase << hex << static_cast<uint32_t>(ch);
> } else {
> ss << ch;
> }
> }
> (*out) = ss.str();
> }
> {code}
> https://github.com/apache/impala/blob/3f51a6a761bcb93675088afc155ccc3d2a380401/be/src/util/coding-util.cc#L50-L68
> With adding some more loggings, I can see "运营业务数据" is encoded into
> "�%FFFFFFBF�营业务数据" by this method.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]