findepi commented on issue #2837: URL: https://github.com/apache/iceberg/issues/2837#issuecomment-883140345
> If this only affects code points like 💰 then I'm not sure that we need to add compatibility. But if this affects normal use in character-based languages then we should build and document a fix Emojis are probably the most common non-BMP symbols but quick search found https://stackoverflow.com/questions/5567249/what-are-the-most-common-non-bmp-unicode-characters-in-actual-use where someone mentioned symbols like "ð¨Ž", "ð ¬ ", and "ð©·¶ (these are probably not regular words?), mathematical symbols and other things. I would _assume_ values containing arbitrary inputs are never bucketed on though. > but we need to create and(eq("col_bucket", 4), eq("col_bucket", 12)) instead to pick up data incorrectly placed in bucket 4 That's easy to do today -- e.g. just call two different Guava APIs for Murmur3_32 to get numbers 4 and 12. Once Guava is changed -- or for any application not using Guava -- it would require copying the affected/buggy hashing algorithm and making it part of the fix. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
