Zoltán Borók-Nagy created IMPALA-14229:
------------------------------------------

             Summary: Iceberg STRINGs are not UTF-8 aware in Impala
                 Key: IMPALA-14229
                 URL: https://issues.apache.org/jira/browse/IMPALA-14229
             Project: IMPALA
          Issue Type: Bug
            Reporter: Zoltán Borók-Nagy


The Iceberg spec states that STRINGs are UTF-8 encoded 
[https://iceberg.apache.org/spec/#primitive-types]

But Impala still mostly treats them as raw byte arrays. Because of this several 
things do not work:
{noformat}
create table ice_str (s string)
partitioned by spec(truncate(2, s))
stored by iceberg;

> insert into ice_str values ('tüüü');
2025-07-15 17:25:39 [Exception]  ERROR: Query f04bccbb7613822c:12d814bf00000000 
failed:
RuntimeException: java.nio.charset.MalformedInputException: Input length = 1
CAUSED BY: MalformedInputException: Input length = 1
{noformat}
Or produce incorrect results:
{noformat}
> insert into ice_str values ('üüü');

> show files in ice_str
hdfs://localhost:20500/test-warehouse/ice_str/data/s_trunc=ü/xxx_data.0.parq 
<== incorrect partition, Hive also URL-encodes the UTF-8 characters

> select s, length(s) from ice_str;
+-----+-----------+
| s   | length(s) |
+-----+-----------+
| üüü | 6         | <== length should be 3
+-----+-----------+{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to