[ 
https://issues.apache.org/jira/browse/IMPALA-11954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17702789#comment-17702789
 ] 

Gabor Kaszab commented on IMPALA-11954:
---------------------------------------

Apparnetly, it's not just the "/" char that is affected but everything that is 
encoded by UrlEncode here:
[https://github.com/apache/impala/blob/a6333aed6b0fe2cf355aeaa1952735b9208b2f43/be/src/util/coding-util.cc#L50]

One example:
{code:java}
insert into tmp_ice values(119, '12"31"11');
select * from tmp_ice;
+-----+-----------------+
| i   | date_string_col |
+-----+-----------------+
| 119 | 12%2231%2211    |
+-----+-----------------+
{code}

The root cause of this issue is seems to be here:
https://github.com/apache/impala/blob/a6333aed6b0fe2cf355aeaa1952735b9208b2f43/be/src/exec/hdfs-table-sink.cc#L562
the hdfs-root-sink uses the encoded partition name as partition name not just 
for the file names. Later on for Iceberg tables this encoded partition name is 
passes to the UpdateCatalog operation and is going to be save to the table 
metadata as the partition name instead of the non-encoded version.
One thing I have to investigate is how Hive tables would behave if I changed 
the code here for the partition name to hold the non-encoded value.

> Partition an Iceberg table on a string col with '/' char gives incorrect 
> results
> --------------------------------------------------------------------------------
>
>                 Key: IMPALA-11954
>                 URL: https://issues.apache.org/jira/browse/IMPALA-11954
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Frontend
>    Affects Versions: Impala 4.0.0
>            Reporter: Gabor Kaszab
>            Assignee: Gabor Kaszab
>            Priority: Blocker
>              Labels: correctness, impala-iceberg
>
> Repro:
> {code:java}
> CREATE TABLE IF NOT EXISTS tmp_ice
> (id int, date_string_col string)
> PARTITIONED BY SPEC (date_string_col)
> STORED AS ICEBERG;
> insert into tmp_ice select id, date_string_col from 
> functional_parquet.alltypes;
> select * from tmp_ice where date_string_col = "09/01/09";
> {code}
> This select gives zero rows.
> However, I create the table partitioned by another col, e.g. 'id' then the 
> very same select gives 10 rows as expected.
> The issue may be somewhere here where we split the path by '/' char:
> https://github.com/apache/impala/blob/47c71bbb32d34d4583856af227206934b6f15136/fe/src/main/java/org/apache/impala/util/IcebergUtil.java#L693



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to