parisni commented on PR #8657: URL: https://github.com/apache/hudi/pull/8657#issuecomment-1546907688
I dig a bit in the spark murmur 3 implementation. It is not standard at least of two reason: 1. they use a hardcoded seed = 42 (which likely would not be the same as hive) 2. [they claim their way of dealing with murmur is not standard](https://github.com/apache/spark/blob/b23185080cc3e5a00b88496cec70c2b3cd7019f5/common/unsafe/src/main/java/org/apache/spark/unsafe/hash/Murmur3_x86_32.java#L67-L68) there is [an issue about this](https://issues.apache.org/jira/browse/SPARK-23381) and a other implementation (=hashUnsafeBytes2) exists, but it is not used so far. Then I am not sure we could use [guava murmur3 as is](https://guava.dev/releases/23.0/api/docs/com/google/common/hash/Hashing.html#murmur3_32-int-) The spark3 implementation is based on catalyst expression while in hudi we work with java types. If we want to use their implementation we should import spark-unsafe as a dependency in the hudi-client-common. We could also copy their implementation within hudi and maintain it. However in both case we would have to convert basic java types into catalyst types to be able to re-use the spark implementation (see https://github.com/apache/spark/blob/v3.4.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala#L523-L596). I am not sure it is a good design to introduce spark concepts within hudi-client-common -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
