Ben-Zvi commented on issue #1662: DRILL-6825: apply different hash algorithms to different data types URL: https://github.com/apache/drill/pull/1662#issuecomment-468894618 Adding the hash32() method to the ValueVector is useful; however picking up algorithms just based on a paper or being famous may not be good enough. At my previous employer I evaluated many hash functions by actually running (stand alone) performance and distribution tests. One clear result shown back then is that murmur performed well on long strings, less good on shorter data. How do the new hash functions in IntegerHashing compare with the existing one in HashHelper ? The Boost implementation of hash_combine looks "fishy" (e.g., some bits get more used than others) -- see some more critique at https://stackoverflow.com/questions/35985960/c-why-is-boosthash-combine-the-best-way-to-combine-hash-values Why can't the seed be given directly to the hash function instead of being "combined" later ? Another good hash function used in the past (don't recall any name) worked with a map of 256 prime numbers, and the code (starting with the seed) was using each input byte as an index to the map - rotate old value, XOR with new mapped value, continue .... Now things may perform differently in Java. Also - do you know of any open source hash functions we can just import instead of writing the code in Drill ?
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
