Github user jacques-n commented on a diff in the pull request:
https://github.com/apache/drill/pull/430#discussion_r56925560
--- Diff:
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/HashHelper.java
---
@@ -17,47 +17,77 @@
*/
package org.apache.drill.exec.expr.fn.impl;
+import io.netty.buffer.DrillBuf;
+import org.apache.drill.common.config.DrillConfig;
+import org.apache.drill.common.exceptions.DrillConfigurationException;
+
import java.nio.ByteBuffer;
import java.nio.ByteOrder;
-public class HashHelper {
+public abstract class HashHelper {
static final org.slf4j.Logger logger =
org.slf4j.LoggerFactory.getLogger(HashHelper.class);
+ public static final String defaultHashClassName = new
String("org.apache.drill.exec.expr.fn.impl.MurmurHash3");
+ static final String HASH_CLASS_PROP = "drill.exec.hash.class";
+ static String actualHashClassName = defaultHashClassName;
+ static DrillHash hashCall = new MurmurHash3();
+ static {
- /** taken from mahout **/
- public static int hash(ByteBuffer buf, int seed) {
- // save byte order for later restoration
-
- int m = 0x5bd1e995;
- int r = 24;
+ try {
+ DrillConfig config = DrillConfig.create();
+ String configuredClassName = config.getString(HASH_CLASS_PROP);
+ if(configuredClassName != null && configuredClassName != "") {
+ actualHashClassName = configuredClassName;
+ hashCall = config.getInstanceOf(HASH_CLASS_PROP, DrillHash.class);
+ }
+ logger.debug("HashHelper initializes with " + actualHashClassName);
+ }
+ catch(Exception ex){
+ logger.error("Could not initialize Hash %s", ex.getMessage());
+ }
+ }
- int h = seed ^ buf.remaining();
+ public static String getHashClassName(){
+ return actualHashClassName;
+ }
- while (buf.remaining() >= 4) {
- int k = buf.getInt();
+ public static int hash32(int val, long seed) {
+ double converted = val;
+ return hash32(converted, seed);
+ }
+ public static int hash32(long val, long seed) {
+ double converted = val;
+ return hash32(converted, seed);
+ }
+ public static int hash32(float val, long seed){
+ double converted = val;
+ return hash32(converted, seed);
+ }
- k *= m;
- k ^= k >>> r;
- k *= m;
+ public static int hash32(double val, long seed){
+ return hashCall.hash32(val, seed);
+ }
- h *= m;
- h ^= k;
- }
+ public static int hash32(int start, int end, DrillBuf buffer, int seed){
+ return hashCall.hash32(start, end, buffer, seed);
--- End diff --
Yes, I'm worried about the extra performance hit. I believe we already
spend a reasonable amount of processing time applying hash functions and have
considered it an opportunity for improvement. Give your current construction,
we would need to dereference the field everytime we call the hash function. In
the past my analysis of assembly out of the JVM is that this isn't typically
removed. Directly binding to a static function doesn't require this overhead.
Take a look at the jvm bytecode (or assembly) to see the difference. In
general, our goal inside individual functions is to avoid indirection as much
as possible, especially with a hot path such as the hash function.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---