Quanlong Huang created IMPALA-11499:
---------------------------------------

             Summary: Partition values with unicode chars are not handled 
correctly
                 Key: IMPALA-11499
                 URL: https://issues.apache.org/jira/browse/IMPALA-11499
             Project: IMPALA
          Issue Type: Bug
          Components: Backend
            Reporter: Quanlong Huang
            Assignee: Quanlong Huang


Partition values are incorrectly URL-encoded in backend for unicode characters, 
e.g. '运营业务数据' is encoded to '�%FFFFFFBF�营业务数据' which is wrong.

To reproduce the issue, first create a partition table:
{code:sql}
create table my_part_tbl (id int) partitioned by (p string) stored as parquet;
{code}
Then insert data into it using partition values containing '运'. They will fail:
{noformat}
[localhost:21050] default> insert into my_part_tbl partition(p='运营业务数据') values 
(0);
Query: insert into my_part_tbl partition(p='运营业务数据') values (0)
Query submitted at: 2022-08-16 10:03:56 (Coordinator: 
http://quanlong-OptiPlex-BJ:25000)
Query progress can be monitored at: 
http://quanlong-OptiPlex-BJ:25000/query_plan?query_id=404ac3027c4b7169:39d16a2d00000000
ERROR: Error(s) moving partition files. First error (of 1) was: Hdfs op (RENAME 
hdfs://localhost:20500/test-warehouse/my_part_tbl/_impala_insert_staging/404ac3027c4b7169_39d16a2d00000000/.404ac3027c4b7169-39d16a2d00000000_1475855322_dir/p=�%FFFFFFBF�营业务数据/404ac3027c4b7169-39d16a2d00000000_1585092794_data.0.parq
 TO 
hdfs://localhost:20500/test-warehouse/my_part_tbl/p=�%FFFFFFBF�营业务数据/404ac3027c4b7169-39d16a2d00000000_1585092794_data.0.parq)
 failed, error was: 
hdfs://localhost:20500/test-warehouse/my_part_tbl/_impala_insert_staging/404ac3027c4b7169_39d16a2d00000000/.404ac3027c4b7169-39d16a2d00000000_1475855322_dir/p=�%FFFFFFBF�营业务数据/404ac3027c4b7169-39d16a2d00000000_1585092794_data.0.parq
Error(5): Input/output error

[localhost:21050] default> insert into my_part_tbl partition(p='运') values (0);
Query: insert into my_part_tbl partition(p='运') values (0)
Query submitted at: 2022-08-16 10:04:22 (Coordinator: 
http://quanlong-OptiPlex-BJ:25000)
Query progress can be monitored at: 
http://quanlong-OptiPlex-BJ:25000/query_plan?query_id=a64e5883473ec28d:86e7e33500000000
ERROR: Error(s) moving partition files. First error (of 1) was: Hdfs op (RENAME 
hdfs://localhost:20500/test-warehouse/my_part_tbl/_impala_insert_staging/a64e5883473ec28d_86e7e33500000000/.a64e5883473ec28d-86e7e33500000000_1582623091_dir/p=�%FFFFFFBF�/a64e5883473ec28d-86e7e33500000000_163454510_data.0.parq
 TO 
hdfs://localhost:20500/test-warehouse/my_part_tbl/p=�%FFFFFFBF�/a64e5883473ec28d-86e7e33500000000_163454510_data.0.parq)
 failed, error was: 
hdfs://localhost:20500/test-warehouse/my_part_tbl/_impala_insert_staging/a64e5883473ec28d_86e7e33500000000/.a64e5883473ec28d-86e7e33500000000_1582623091_dir/p=�%FFFFFFBF�/a64e5883473ec28d-86e7e33500000000_163454510_data.0.parq
Error(5): Input/output error
{noformat}
However, partition value without the character '运' is OK:
{noformat}
[localhost:21050] default> insert into my_part_tbl partition(p='营业务数据') values 
(0);
Query: insert into my_part_tbl partition(p='营业务数据') values (0)
Query submitted at: 2022-08-16 10:04:13 (Coordinator: 
http://quanlong-OptiPlex-BJ:25000)
Query progress can be monitored at: 
http://quanlong-OptiPlex-BJ:25000/query_plan?query_id=b04894bfcfc3836a:b1ac903600000000
Modified 1 row(s) in 0.21s
{noformat}

Hive is able to execute all these statements.

I'm able to narrow down the issue into Backend, where we URL-encode the 
partition value in HdfsTableSink::InitOutputPartition():
{code:cpp}
      string value_str;
      partition_key_expr_evals_[j]->PrintValue(value, &value_str);
      // Directory names containing partition-key values need to be UrlEncoded, 
in
      // particular to avoid problems when '/' is part of the key value (which 
might
      // occur, for example, with date strings). Hive will URL decode the value
      // transparently when Impala's frontend asks the metastore for partition 
key values,
      // which makes it particularly important that we use the same encoding as 
Hive. It's
      // also not necessary to encode the values when writing partition 
metadata. You can
      // check this with 'show partitions <tbl>' in Hive, followed by a select 
from a
      // decoded partition key value.
      string encoded_str;
      UrlEncode(value_str, &encoded_str, true);
string part_key_value = (encoded_str.empty() ?
                              table_desc_->null_partition_key_value() : 
encoded_str);
      // If the string is empty, map it to nullptr (mimicking Hive's behaviour)
      partition_name_ss << part_key_value;
      if (is_external_part) {
        external_partition_name_ss << part_key_value;
      }
{code}
https://github.com/apache/impala/blob/3f51a6a761bcb93675088afc155ccc3d2a380401/be/src/exec/hdfs-table-sink.cc#L543

Current code of UrlEncode seems not handling UTF-8 encoded strings correctly:
{code:cpp}
static inline void UrlEncode(const char* in, int in_len, string* out, bool 
hive_compat) {
  (*out).reserve(in_len);
  stringstream ss;
  for (int i = 0; i < in_len; ++i) {
    const char ch = in[i];
    // Escape the character iff a) we are in Hive-compat mode and the
    // character is in the Hive whitelist or b) we are not in
    // Hive-compat mode, and the character is not alphanumeric or one
    // of the four commonly excluded characters.
    if ((hive_compat && HiveShouldEscape(ch)) ||
        (!hive_compat && !(isalnum(ch) || ShouldNotEscape(ch)))) {
      ss << '%' << uppercase << hex << static_cast<uint32_t>(ch);
    } else {
      ss << ch;
    }
  }

  (*out) = ss.str();
}
{code}
https://github.com/apache/impala/blob/3f51a6a761bcb93675088afc155ccc3d2a380401/be/src/util/coding-util.cc#L50-L68



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to