Re: string bucketing compatibility issue

Ryan Blue Mon, 19 Jul 2021 16:22:14 -0700

Thanks, Piotr! I've added this to the agenda for our next sync. I think the
main question is whether we think users have hit this problem or not. If we
don't think that it is something users have probably hit, then we can just
fix the problem. If we think someone has data stored with the incorrect
bucket values, we'll have to update Iceberg to account for it.


It would be great to hear from other people on this list that may be able
to comment about how many codepoints are affected and whether those are
expected to be used in these cases. For example, a handful of emoji that
probably aren't used in bucketing is not a big concern, but if certain
languages are heavily affected then this is likely a problem we should
address.

On Mon, Jul 19, 2021 at 2:37 AM Piotr Findeisen <pi...@starburstdata.com>
wrote:

> Hi,
>
> I've filed https://github.com/apache/iceberg/issues/2837 for this as well.
>
> Best
> PF
>
>
>
> On Sat, Jul 17, 2021 at 12:48 AM Piotr Findeisen <pi...@starburstdata.com>
> wrote:
>
>> Hi,
>>
>> It was discovered by @Mateusz Gajewski
>> <mateusz.gajew...@starburstdata.com> that Iceberg bucketing
>> transformation for string isn't regular Murmur3 32-bit hash.
>>
>> Upon closer investigation we found out that the code:
>>
>>
>> https://github.com/apache/iceberg/blob/0c50b2074cd5dad59bbcb4b4599ec3ae11a34b49/api/src/main/java/org/apache/iceberg/transforms/Bucket.java#L239
>>
>> is affected by Guava issue https://github.com/google/guava/issues/5648
>> that causes wrong results for input containing surrogate pairs (Unicode
>> codepooints outside of Basic Multilingual Plane).
>>
>> Assuming it's indeed a bug and it gets fixed (I posted a PR to Guava with
>> the proposed fix), this can cause incorrect query results, since bucketing
>> function definition will effectively change.
>>
>> This is mostly FYI, unless we can do something more about it.
>>
>> Best
>> PF
>>
>>

-- 
Ryan Blue
Tabular

Re: string bucketing compatibility issue

Reply via email to