Hi,

It was discovered by @Mateusz Gajewski
<mateusz.gajew...@starburstdata.com> that
Iceberg bucketing transformation for string isn't regular Murmur3 32-bit
hash.

Upon closer investigation we found out that the code:

https://github.com/apache/iceberg/blob/0c50b2074cd5dad59bbcb4b4599ec3ae11a34b49/api/src/main/java/org/apache/iceberg/transforms/Bucket.java#L239

is affected by Guava issue https://github.com/google/guava/issues/5648 that
causes wrong results for input containing surrogate pairs (Unicode
codepooints outside of Basic Multilingual Plane).

Assuming it's indeed a bug and it gets fixed (I posted a PR to Guava with
the proposed fix), this can cause incorrect query results, since bucketing
function definition will effectively change.

This is mostly FYI, unless we can do something more about it.

Best
PF

Reply via email to