[ 
https://issues.apache.org/jira/browse/DRILL-6825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16726299#comment-16726299
 ] 

Boaz Ben-Zvi commented on DRILL-6825:
-------------------------------------

[~weijie] - When iterating, each data type vector may use a different hash 
function. Indeed for variable sized types (usually VARCHAR) a given hash 
function may not perform best if the value is long; however as these values are 
used as (join/aggr) keys, they are typically of a reasonable size (e.g. <= 16). 
If some users insists on using long keys, they deserve poor performance :) 

We could also have a collection of hash functions, and use some configuration 
map to assign each type its hash function.

The suggestion to extract all the key columns into a temporary buffer and then 
apply a single function over this buffer also has costs, like the copy and the 
inflexibility of using the same hash function for all.

Here is an example for a type specific hash function: For TIMESTAMP - take the 
YYYYMMDD part and XOR with the seed, then perform a (slower) hash on each byte 
of the microseconds part (the latter part usually has more entropy).

> Applying different hash function according to data types and data size
> ----------------------------------------------------------------------
>
>                 Key: DRILL-6825
>                 URL: https://issues.apache.org/jira/browse/DRILL-6825
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Execution - Codegen
>            Reporter: weijie.tong
>            Assignee: weijie.tong
>            Priority: Major
>             Fix For: 1.16.0
>
>
> Different hash functions have different performance according to different 
> data types and data size. We should choose a right one to apply not just 
> Murmurhash.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to