parisni commented on PR #8657:
URL: https://github.com/apache/hudi/pull/8657#issuecomment-1546907688

   I dig a bit in the spark murmur 3 implementation. It is not standard at 
least of two reason:
   1. they use a hardcoded seed = 42 (which likely would not be the same as 
hive)
   2. [they claim their way of dealing with murmur is not 
standard](https://github.com/apache/spark/blob/b23185080cc3e5a00b88496cec70c2b3cd7019f5/common/unsafe/src/main/java/org/apache/spark/unsafe/hash/Murmur3_x86_32.java#L67-L68)
 there is [an issue about 
this](https://issues.apache.org/jira/browse/SPARK-23381) and a other 
implementation (=hashUnsafeBytes2) exists, but it is not used so far.
   
   Then I am not sure we could use [guava murmur3 as 
is](https://guava.dev/releases/23.0/api/docs/com/google/common/hash/Hashing.html#murmur3_32-int-)
   
   The spark3 implementation is based on catalyst expression while in hudi we 
work with java types. If we want to use their implementation we should import 
spark-unsafe as a dependency in the hudi-client-common. We could also copy 
their implementation within hudi and maintain it. However in both case we would 
have to convert basic java types into catalyst types to be able to re-use the 
spark implementation (see 
https://github.com/apache/spark/blob/v3.4.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala#L523-L596).
 I am not sure it is a good design to introduce spark concepts within 
hudi-client-common


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to