[GitHub] [hudi] parisni commented on pull request #8657: [HUDI-6150] Support bucketing for each hive client


parisni commented on PR #8657:
URL: https://github.com/apache/hudi/pull/8657#issuecomment-1546907688

I dig a bit in the spark murmur 3 implementation. It is not standard at
least of two reason:
1. they use a hardcoded seed = 42 (which likely would not be the same as
hive)
2. [they claim their way of dealing with murmur is not
standard](https://github.com/apache/spark/blob/b23185080cc3e5a00b88496cec70c2b3cd7019f5/common/unsafe/src/main/java/org/apache/spark/unsafe/hash/Murmur3_x86_32.java#L67-L68)
there is [an issue about
this](https://issues.apache.org/jira/browse/SPARK-23381) and a other
implementation (=hashUnsafeBytes2) exists, but it is not used so far.

Then I am not sure we could use [guava murmur3 as
is](https://guava.dev/releases/23.0/api/docs/com/google/common/hash/Hashing.html#murmur3_32-int-)

The spark3 implementation is based on catalyst expression while in hudi we
work with java types. If we want to use their implementation we should import
spark-unsafe as a dependency in the hudi-client-common. We could also copy
their implementation within hudi and maintain it. However in both case we would
have to convert basic java types into catalyst types to be able to re-use the
spark implementation (see
https://github.com/apache/spark/blob/v3.4.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala#L523-L596).
I am not sure it is a good design to introduce spark concepts within
hudi-client-common

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to