[ 
https://issues.apache.org/jira/browse/DRILL-1993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14276478#comment-14276478
 ] 

Aman Sinha commented on DRILL-1993:
-----------------------------------

+1.  
Since the commit message seems verbose it would be good to move some of that 
information as comments in the code.... 

> Fix allocation issues in HashTable and HashAgg to reduce memory waste
> ---------------------------------------------------------------------
>
>                 Key: DRILL-1993
>                 URL: https://issues.apache.org/jira/browse/DRILL-1993
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Operators
>            Reporter: Venki Korukanti
>            Assignee: Venki Korukanti
>             Fix For: 0.8.0
>
>         Attachments: DRILL-1993-1.patch, DRILL-1993-2.patch
>
>
> Issues found:
> + Key container allocation issue in HashTable
> Currently we allocate 2^16 records capacity memory for "hashValues" and 
> "links" vectors in each BatchHolder, but for "key" holders we use the 
> allocateNew which by default allocates low capacity memory compared to 
> "hashValues" and "links" vector capacities (incase of Integer key, capacity 
> is 4096 records). This causes "key" holders to fill up much sooner even 
> though "hashValues" and "links" vectors still have lot of free entries. As a 
> result we create more BatchHolders than required causing wasted space in 
> "links" and "hashValues" vectors in each BatchHolder. And for each new 
> BatchHolder we create a SV4 vector in HashJoinHelper which is another 
> overhead.
> + Allocation issues in HashAggTemplate
> HashAggTemplate has its own BatchHolders which has vectors allocated using 
> allocateNew (i.e small capacity). Whenever a BatchHolder in HashAggTemplate 
> reaches its capacity, we add a new BatchHolder in HashTable. This causes the 
> HashTable BatchHolders to be not space efficient.
> + Update the HashAggTemplate.outputCurrentBatch to consider cases where all 
> entries in a single BatchHolder are can't be copied over to output vectors in 
> a single pass (output vectors capacity is lower than the number of records in 
> BatchHolder)
> + Lazy BatchHolder creation for both HashAgg and HashTable
> Don't allocate the BatchHolder until first put request is received. This way 
> we don't waste space in fragments which don't receive any input records. This 
> is possible when the group by key has very few distinct values (such as 
> shipmode in TPCH) only few fragments receive the data. Our current 
> parallelization code is not considering the distinct values when 
> parallelizing hash exchanges.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to