[ 
https://issues.apache.org/jira/browse/DRILL-6825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16725571#comment-16725571
 ] 

weijie.tong commented on DRILL-6825:
------------------------------------

[~ben-zvi]  Some thought about this issue.
The suggestion to make ValueVector have hash method is not well enough to solve 
the performance problem.
If the key column size is 1, the suggestion is good. Since different hash 
functions have different performance over different datatypes.
If the key column size is more than 1, then the iterate hash invocation over 
different ValueVector by different hash function may does not have better 
performance and the result maybe not right. Since some hash functions like 
XXHash have good performance over big input size. Maybe we can copy the value 
from different columns to construct a bigger input row , then invoke the XXHash 
to hash the constructed row to get a hashed value. That's to say to one row ,we 
have one time hash not more times hash as before. Here, maybe ValueVector could 
have a maxByteSize method to indicate its max bytes width of all the rows. So 
we could allocate one fix size memory ahead to hold all the data from different 
key columns to save some memory allocation cost while copying the data from 
different key columns.

In summary, to one key column, we pay attention to the data type of the 
ValueVector to choose a suitable hash function, to more than one key columns, 
we pay attention to the key data size to choose a suitable hash function.







> Applying different hash function according to data types and data size
> ----------------------------------------------------------------------
>
>                 Key: DRILL-6825
>                 URL: https://issues.apache.org/jira/browse/DRILL-6825
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Execution - Codegen
>            Reporter: weijie.tong
>            Assignee: weijie.tong
>            Priority: Major
>             Fix For: 1.16.0
>
>
> Different hash functions have different performance according to different 
> data types and data size. We should choose a right one to apply not just 
> Murmurhash.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to