[ https://issues.apache.org/jira/browse/DRILL-6825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16726299#comment-16726299 ]
Boaz Ben-Zvi commented on DRILL-6825: ------------------------------------- [~weijie] - When iterating, each data type vector may use a different hash function. Indeed for variable sized types (usually VARCHAR) a given hash function may not perform best if the value is long; however as these values are used as (join/aggr) keys, they are typically of a reasonable size (e.g. <= 16). If some users insists on using long keys, they deserve poor performance :) We could also have a collection of hash functions, and use some configuration map to assign each type its hash function. The suggestion to extract all the key columns into a temporary buffer and then apply a single function over this buffer also has costs, like the copy and the inflexibility of using the same hash function for all. Here is an example for a type specific hash function: For TIMESTAMP - take the YYYYMMDD part and XOR with the seed, then perform a (slower) hash on each byte of the microseconds part (the latter part usually has more entropy). > Applying different hash function according to data types and data size > ---------------------------------------------------------------------- > > Key: DRILL-6825 > URL: https://issues.apache.org/jira/browse/DRILL-6825 > Project: Apache Drill > Issue Type: Improvement > Components: Execution - Codegen > Reporter: weijie.tong > Assignee: weijie.tong > Priority: Major > Fix For: 1.16.0 > > > Different hash functions have different performance according to different > data types and data size. We should choose a right one to apply not just > Murmurhash. -- This message was sent by Atlassian JIRA (v7.6.3#76005)