weijietong commented on issue #1662: DRILL-6825: apply different hash 
algorithms to different data types
URL: https://github.com/apache/drill/pull/1662#issuecomment-469236472
 
 
   The IntegerHashing's method was also used in ClickHouse for integer 
types(see: 
https://github.com/yandex/ClickHouse/blob/master/dbms/src/Common/HashTable/Hash.h
   intHash32 method). CK does a fine hashing method choosing according to the 
data types and keys width which is valuable for us to learn. As you mentioned 
Murmur3Hash does not have a good performance at the shorter integer case.So 
it's better to use the IntegerHash at the integer keys case.
   
   The Boost implementation's discussion you mentioned I had read before. But I 
think it's reasonable why Boost still keep the current implementation now as a 
base library. 
   
   The reason to keep seed away from the hash32 function and involve the 
Boost's hash_combine method is that I want to change the current hashing 
strategy later. I plan to change the hash32(hash32(hash32)) row iterate model 
to `hash32() hash_combine hash32() hash_combine hash32()` column combine model 
at the multi-keys case. The row iterate module has a data dependency and will 
hurt the cpu pipeline performance.
   
   Other hashing methods I know can be found here: 
https://github.com/benalexau/hash-bench.  It's a java hashing method 
collection. The benchmark I run showed that 
https://github.com/OpenHFT/Zero-Allocation-Hashing/blob/master/src/main/java/net/openhft/hashing/LongHashFunction.java
 's city_1_1 has a best performance at 32,64 bytes key width.
   
   I also wonder whether we can do the join keys data type implication at the 
project node later. So the HashJoin and Exchange node can also benefit from 
this PR.
   
   
   
   
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to