[
https://issues.apache.org/jira/browse/DRILL-6825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16726299#comment-16726299
]
Boaz Ben-Zvi commented on DRILL-6825:
-------------------------------------
[~weijie] - When iterating, each data type vector may use a different hash
function. Indeed for variable sized types (usually VARCHAR) a given hash
function may not perform best if the value is long; however as these values are
used as (join/aggr) keys, they are typically of a reasonable size (e.g. <= 16).
If some users insists on using long keys, they deserve poor performance :)
We could also have a collection of hash functions, and use some configuration
map to assign each type its hash function.
The suggestion to extract all the key columns into a temporary buffer and then
apply a single function over this buffer also has costs, like the copy and the
inflexibility of using the same hash function for all.
Here is an example for a type specific hash function: For TIMESTAMP - take the
YYYYMMDD part and XOR with the seed, then perform a (slower) hash on each byte
of the microseconds part (the latter part usually has more entropy).
> Applying different hash function according to data types and data size
> ----------------------------------------------------------------------
>
> Key: DRILL-6825
> URL: https://issues.apache.org/jira/browse/DRILL-6825
> Project: Apache Drill
> Issue Type: Improvement
> Components: Execution - Codegen
> Reporter: weijie.tong
> Assignee: weijie.tong
> Priority: Major
> Fix For: 1.16.0
>
>
> Different hash functions have different performance according to different
> data types and data size. We should choose a right one to apply not just
> Murmurhash.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)