[GitHub] [iceberg] rdblue commented on issue #2837: Incorrect bucket value calculated for string with non-BMP characters

GitBox Wed, 21 Jul 2021 16:32:11 -0700


rdblue commented on issue #2837:
URL: https://github.com/apache/iceberg/issues/2837#issuecomment-884562950



   @findepi, thanks for the extra info about how wide-spread this is.
   
   > Could the string bucketing fix be considered part of v2 breaking changes?
   
   There's no problem with the spec here because the spec requires the hash 
value to be equivalent to hashing the UTF-8 bytes.
   
   The problem is with the Iceberg reference implementation, so we need to 
decide how to address that. We discussed this at our sync this morning and the 
rough consensus was to notify users that this is an issue, but not implement 
the change to filter projection. There were two main reasons: (1) there are 
other operations that use bucketing (joins) and bad partition values can't be 
fixed in those cases and (2) supporting the work-around once Guava is fixed 
will be difficult and require us to keep the buggy Guava code in Iceberg.
   
   Like you said bucketing by arbitrary string inputs with these characters 
should be fairly rare. Someone might bucket by a UUID string, but those are 
predictable and don't use non-BMP characters.
   
   Our current plan is to fix the hash implementation and include the issue in 
release notes. We're also working on a way to repartition data in an action, so 
that should help people that are caught by this. In the meantime, we can tell 
people how to use MERGE INTO to fix the bad data.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] rdblue commented on issue #2837: Incorrect bucket value calculated for string with non-BMP characters

Reply via email to