Thanks, Piotr! I've added this to the agenda for our next sync. I think the main question is whether we think users have hit this problem or not. If we don't think that it is something users have probably hit, then we can just fix the problem. If we think someone has data stored with the incorrect bucket values, we'll have to update Iceberg to account for it.
It would be great to hear from other people on this list that may be able to comment about how many codepoints are affected and whether those are expected to be used in these cases. For example, a handful of emoji that probably aren't used in bucketing is not a big concern, but if certain languages are heavily affected then this is likely a problem we should address. On Mon, Jul 19, 2021 at 2:37 AM Piotr Findeisen <pi...@starburstdata.com> wrote: > Hi, > > I've filed https://github.com/apache/iceberg/issues/2837 for this as well. > > Best > PF > > > > On Sat, Jul 17, 2021 at 12:48 AM Piotr Findeisen <pi...@starburstdata.com> > wrote: > >> Hi, >> >> It was discovered by @Mateusz Gajewski >> <mateusz.gajew...@starburstdata.com> that Iceberg bucketing >> transformation for string isn't regular Murmur3 32-bit hash. >> >> Upon closer investigation we found out that the code: >> >> >> https://github.com/apache/iceberg/blob/0c50b2074cd5dad59bbcb4b4599ec3ae11a34b49/api/src/main/java/org/apache/iceberg/transforms/Bucket.java#L239 >> >> is affected by Guava issue https://github.com/google/guava/issues/5648 >> that causes wrong results for input containing surrogate pairs (Unicode >> codepooints outside of Basic Multilingual Plane). >> >> Assuming it's indeed a bug and it gets fixed (I posted a PR to Guava with >> the proposed fix), this can cause incorrect query results, since bucketing >> function definition will effectively change. >> >> This is mostly FYI, unless we can do something more about it. >> >> Best >> PF >> >> -- Ryan Blue Tabular