rdblue commented on issue #2837: URL: https://github.com/apache/iceberg/issues/2837#issuecomment-884562950
@findepi, thanks for the extra info about how wide-spread this is. > Could the string bucketing fix be considered part of v2 breaking changes? There's no problem with the spec here because the spec requires the hash value to be equivalent to hashing the UTF-8 bytes. The problem is with the Iceberg reference implementation, so we need to decide how to address that. We discussed this at our sync this morning and the rough consensus was to notify users that this is an issue, but not implement the change to filter projection. There were two main reasons: (1) there are other operations that use bucketing (joins) and bad partition values can't be fixed in those cases and (2) supporting the work-around once Guava is fixed will be difficult and require us to keep the buggy Guava code in Iceberg. Like you said bucketing by arbitrary string inputs with these characters should be fairly rare. Someone might bucket by a UUID string, but those are predictable and don't use non-BMP characters. Our current plan is to fix the hash implementation and include the issue in release notes. We're also working on a way to repartition data in an action, so that should help people that are caught by this. In the meantime, we can tell people how to use MERGE INTO to fix the bad data. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
