findepi commented on issue #2837:
URL: https://github.com/apache/iceberg/issues/2837#issuecomment-883140345


   > If this only affects code points like 💰 then I'm not sure that we need to 
add compatibility. But if this affects normal use in character-based languages 
then we should build and document a fix
   
   Emojis are probably the most common non-BMP symbols but quick search found 
https://stackoverflow.com/questions/5567249/what-are-the-most-common-non-bmp-unicode-characters-in-actual-use
 where someone mentioned symbols like "𨭎", "𠬠", and "𩷶 (these are probably not 
regular words?), mathematical symbols and other things.
   
   I would _assume_ values containing arbitrary inputs are never bucketed on 
though.
   
   
   > but we need to create and(eq("col_bucket", 4), eq("col_bucket", 12)) 
instead to pick up data incorrectly placed in bucket 4
   
   That's easy to do today -- e.g. just call two different Guava APIs for 
Murmur3_32 to get numbers 4 and 12.
   Once Guava is changed -- or for any application not using Guava -- it would 
require copying the affected/buggy hashing algorithm and making it part of the 
fix.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to